André Platzer 
Geoff Sutcliffe (Eds.) 


Automated Deduction - 
CADE 28 


28th International Conference on Automated Deduction 
Virtual Event, July 12-15, 2021 
Proceedings 


LNAI 12699 


A Springer 


Lecture Notes in Artificial Intelligence 


Subseries of Lecture Notes in Computer Science 


Series Editors 


Randy Goebel 
University of Alberta, Edmonton, Canada 


Yuzuru Tanaka 
Hokkaido University, Sapporo, Japan 
Wolfgang Wahlster 


DFKI and Saarland University, Saarbriicken, Germany 
Founding Editor 


Jörg Siekmann 
DFKI and Saarland University, Saarbriicken, Germany 


12699 


More information about this subseries at http://www.springer.com/series/1244 


André Platzer - Geoff Sutcliffe (Eds.) 


Automated Deduction — 
CADE 28 


28th International Conference on Automated Deduction 
Virtual Event, July 12—15, 2021 
Proceedings 


GÀ Springer 


Editors 


André Platzer® Geoff Sutcliffe © 
Carnegie Mellon University University of Miami 
Pittsburgh, PA, USA Coral Gables, FL, USA 


ISSN 0302-9743 ISSN 1611-3349 (electronic) 
Lecture Notes in Artificial Intelligence 
ISBN 978-3-030-79875-8 ISBN 978-3-030-79876-5 (eBook) 


https://doi.org/10.1007/978-3-030-79876-5 
LNCS Sublibrary: SL7 — Artificial Intelligence 


© The Editor(s) (if applicable) and The Author(s) 2021. This book is an open access publication. 

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International 
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution 
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and 
the source, provide a link to the Creative Commons license and indicate if changes were made. 

The images or other third party material in this book are included in the book’s Creative Commons license, 
unless indicated otherwise in a credit line to the material. If material is not included in the book’s Creative 
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, 
you will need to obtain permission directly from the copyright holder. 

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication 
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant 
protective laws and regulations and therefore free for general use. 

The publisher, the authors and the editors are safe to assume that the advice and information in this book are 
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors 
give a warranty, expressed or implied, with respect to the material contained herein or for any errors or 
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in 
published maps and institutional affiliations. 


This Springer imprint is published by the registered company Springer Nature Switzerland AG 
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland 


Preface 


This volume contains the proceedings of the 28th International Conference on Auto- 
mated Deduction (CADE-28). CADE is the major forum for the presentation of 
research in all aspects of automated deduction, including foundations, applications, 
implementations, and practical experience. CADE-28 was hosted by Carnegie Mellon 
University, Pittsburgh, USA, 11-16 July 2021, but held online due to the COVID-19 
pandemic. CADE-28 emphasized the breadth of topics that are of interest, including 
applications in and beyond STEM, and the use/contribution of automated deduction 
in AL 

The Program Committee (PC) accepted 36 papers (29 full papers and 7 system 
descriptions) out of 76 submissions (59 full papers, 4 short papers, and 13 system 
descriptions). Each submission was reviewed by at least three Program Committee 
members or their external reviewers. The criteria for evaluation were originality and 
significance, technical quality, comparison with related work, quality of presentation, 
and reproducibility of experiments. 

The program of the conference included four invited talks: 


— Liron Cohen (Ben-Gurion University, Israel): “Non-well-founded Deduction for 
Induction and Coinduction” 

— Guido Governatori (CSIRO, Australia): “Computational Law: Automated Rea- 
soning in the Legal Domain” 

— Mooly Sagiv (Tel Aviv University, Israel): “Formal Reasoning about Decentralized 
Financial Applications” 

— Markus Rabe (Google, USA): “What are the Limits of Neural Networks for 
Automated Reasoning?” 


The conference hosted several workshops, tutorials, and competitions: 


— Workshop: 10th International Workshop on Theorem Proving Components for 
Educational Software. 

— Workshop: Proof eXchange for Theorem Proving. 

— Workshop: Parallel and Distributed Automated Reasoning. 

— Workshop: 17th International Workshop on Termination. 

— Workshop: Logical Frameworks and Meta-Languages - Theory and Practice. 

— Workshop: 3rd International Workshop on Automated Reasoning: Challenges, 
Applications, Directions, Exemplary Achievements. 

— Tutorial: Program Validation and Verification in PVS. Paolo Masci (NIA), Mariano 
Moscato (NIA), César Munoz (NASA), Aaron Dutle (NASA), and Tanner Slagel 
(NASA). 

— Tutorial: Practice of First-Order Reasoning. Stephan Schulz (DHBW), Adam Pease 
(Articulate Software), and Geoff Sutcliffe (University of Miami). 

— Tutorial: Learning to Prove: Machine Learning for Better SAT and QSAT Solvers. 
Sean Holden (University of Cambridge). 


vi 


Preface 


Tutorial: Proof-Theoretical Analysis of Non-Fregean Logic. Szymon Chlebowski, 
Marta Gawek, Dorota Leszczynska-Jasion, and Agata Tomczyk (Adam Mickiewicz 
University). 

Competition: 28th CADE ATP System Competition. Geoff Sutcliffe (University of 
Miami) and Martin Desharnais (Vrije Universiteit Amsterdam). 

Competition: Termination Competition 2021. Albert Rubio (UPC Barcelona) and 
Akihisa Yamada (AIST Tsukuba). 


In addition to the best paper awards, three CADE awards were presented at the 


conference: 


The Herbrand Award for Distinguished Contributions to Automated Reasoning 
(for 2020 and 2021). 

The Thoralf Skolem Awards for CADE papers that have passed the test of time by 
being the most influential papers in the field, for papers from CADE-5 (1980), 
CADE-11 (1992), CADE-17 (2000), and CADE-23 (2011). 

The (newly established) Bill McCune PhD Award for a PhD thesis’ substantive 
contributions to the field of Automated Reasoning. 


Thanks go to the many people without whom the conference would not have been 


possible - the authors, participants, invited spakers, members of the PC and their 
subreviewers, conference chairs, local organizers, the workshop/tutorial/competitions 
chair, the publicity chair, the CADE trustees, the board of the Association for Auto- 
mated Reasoning, the staff at Springer, and the EasyChair team. CADE-28 gratefully 
received support from the Automated Reasoning Group at Amazon Web Services, The 
Journal of Artificial Intelligence, Imandra Inc., and Springer. 
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Abstract. Induction and coinduction are both used extensively within 
mathematics and computer science. Algebraic formulations of these prin- 
ciples make the duality between them apparent, but do not account 
well for the way they are commonly used in deduction. Generally, the 
formalization of these reasoning methods employs inference rules that 
express a general explicit (co)induction scheme. Non-well-founded proof 
theory provides an alternative, more robust approach for formalizing 
implicit (co)inductive reasoning. This approach has been extremely suc- 
cessful in recent years in supporting implicit inductive reasoning, but 
is not as well-developed in the context of coinductive reasoning. This 
paper reviews the general method of non-well-founded proofs, and puts 
forward a concrete natural framework for (co)inductive reasoning, based 
on (co)closure operators, that offers a concise framework in which induc- 
tive and coinductive reasoning are captured as we intuitively understand 
and use them. Through this framework we demonstrate the enormous 
potential of non-well-founded deduction, both in the foundational theoret- 
ical exploration of (co)inductive reasoning and in the provision of proof 
support for (co)inductive reasoning within (semi-)automated proof tools. 


1 Introduction 


The principle of induction is a key technique in mathematical reasoning that 
is widely used in computer science for reasoning about recursive data types 
(such as numbers or lists) and computations. Its dual principle—the princi- 
ple of coinduction [49]69]70]|—is not as widespread, and has only been investi- 
gated for a few decades, but still has many applications in computer science, 
e.g. [42]56]39]52]82[55]57]. It is mainly used for reasoning about coinductive data 
types (codata), which are data structures containing non-well-founded elements, 
e.g., infinite streams or trees. One prominent application of coinduction is as 
a generic formalism for reasoning about state-based dynamical systems, which 
typically contain some sort of circularity. It is key in proofs of the bisimula- 
tion of state-transition systems (i.e., proving that two systems are behaviorally 
equivalent) and is a primary method for reasoning about concurrent systems [53]. 

A duality between induction and coinduction is observed when formulating 
them within an algebraic, or categorical, framework, e.g., [71J64]70]69]. Whereas 


© The Author(s) 2021 
A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 3-24, 2021. 
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induction corresponds to a least-fixed-point semantics (or initial algebras), coin- 
duction corresponds to a greatest-fixed-point semantics (or final coalgebras). 
However, such an algebraic formulation does not account well for the way these 
principles are commonly used in deduction, where they are usually applied in dif- 
ferent ways: induction to prove properties of certain collections, and coinduction 
to show equivalences between processes and systems. 


Since the principle of induction is so well-known, induction methods are 
relatively well-developed. They are available in most (semi-)automated deduction 
systems, and tools for the formal verification of software and hardware such as 
theorem provers. Generally, implementations of the induction method employ 
one or more inference rules that express a general explicit induction scheme that 
holds for the elements being reasoned over. That is, to prove that some property, 
say P, holds for all elements in an inductively defined set, we (i) show that it 
holds for the initial elements, and (ii) show that P is preserved in the inductive 
generation of new elements. A side-effect of such implementations is that in 
applying inductive reasoning, the induction invariant must be provided explicitly. 
While advanced provers offer powerful facilities for producing and manipulating 
inductive goals, this still poses a major automation challenge. This formalization 
of the induction principle uses the classical notion of formal proofs invoked in 
standard theorem provers. There, proofs are well-founded trees, starting at the 
goal and reaching axioms while proceeding by applications of inference rules. 


A more robust and natural alternative formalization of inductive reasoning 
is implicit induction, which avoids the need for explicitly specifying induction 
invariants. This form of reasoning is enabled by extending the standard notion 
of well-founded, finite proof trees into non-well-founded proof trees, where the 
presence of cycles can be exploited instead of cluttering the proof with explicit 
inductive invariants. For example, to prove P(x) using implicit induction, one 
repeatedly decomposes the goal into subgoals that are either provable in the 
standard way (via well-founded subtrees) or reducible back to P(x). This alter- 
native has deep historic roots (originating in Fermat’s infinite-descent method) 
and recently has seen a flourishing of its proof theory via cyclic proof systems. 


Non-well-founded proof theory and its cyclic fragment (comprising only of 
finite and regular proofs) have been extremely successful in recent years in sup- 
porting implicit inductive reasoning. For one, the non-well-founded approach has 
been used to obtain (optimal) cut-free completeness results for highly expressive 
logics, such as the p-calculus [3J35]34)37| and Kleene algebra [3233], providing 
further evidence of its utility for automation. Other works focus on the structural 
proof theory of non-well-founded systems, where these promote additional insights 
into standard proof-theoretical questions by separating local steps of deductive 
inference from global well-foundedness arguments. In particular, syntactic cut 
elimination for non-well-founded systems has been studied extensively in the 
linear logic settings [41/7]. Much work has been devoted to the formal study of 
explicit versus implicit forms of induction in various logical settings including the 
p-calculus , systems for arithmetics [74[31]|, and first-order logics with 
inductive definitions [[9[14]19]. The latter offers a system parameterized by a set 
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of inductive predicates with associated rules, rather than a single rule for induc- 
tion as with the others. The cyclic machinery has also been used to effectively 
search for proofs of inductive properties and automatically verify properties of 
inductive programs, especially in the context of separation logic [78]68]16]17]18} . 


Unlike induction, the coinduction principle has not been so fully and nat- 
urally incorporated into major theorem provers, but it has gained importance 
and attention in recent years. As noted by Basold, Komendantskaya, and Li: 
“it may be surprising that automated proof search for coinductive predicates in 
first-order logic does not have a coherent and comprehensive theory, even after 
three decades...” [8]. Automated provers, to the best of our knowledge, cur- 
rently do not offer any support for coinduction, and while coinductive data types 
have been implemented in interactive theorem provers (a.k.a. proof assistants) 
such as Coq [LJ47J83], Nuprl [30], Isabelle [T3J8T[12/38], Agda [I], Lean [4], 
and Dafny [54], the treatment of these forms of data is often partial. These 
formalizations, as well as other formal frameworks that support the combina- 
tion of induction and coinduction, e.g., [B0]61l6l46], generally rely on making 
(co)invariants explicit within proofs. But just as inductive reasoning is naturally 
captured via proof cycles, cyclic systems seem to be particularly well-suited 
for also encompassing the implicit notion of coinduction. Nonetheless, while 
non-well-founded proof theory has been very successful in supporting inductive 
reasoning, this proof method has not been equally incorporated and explored 
in the context of coinductive reasoning. Some notable cyclic systems that do 
support coinduction in various settings include [67[58]72[36]2]. Another related 
framework is that of Coq’s parameterized coinduction [47J83], which offers a 
different, but highly related, implicit nature of proofs (based on patterns within 
parameters, rather than within proof sequents). 


This paper reviews the general method of non-well-founded proof theory, 
focusing on its use in capturing both implicit inductive and coinductive reasoning. 
Throughout the paper we focus on one very natural and simple logical framework 
to demonstrate the benefits of the approach—that of the transitive (co)closure 
logic. This logic offers a succinct and intuitive dual treatment to induction and 
coinduction, while still supporting their common practices in deduction, making 
it great for prototyping. More specifically, it has the benefits of (1) conciseness: no 
need for a separate language or interpretation for definitions, nor for fully general 
least /greatest-fixed-point operators; (2) intuitiveness: the concept of transitive 
closure is basic, and the dual closure is equally simple to grasp, resulting in 
a simpler metatheory; (3) illumination: similarities, dualities, and differences 
between induction and coinduction are clearly demonstrated; and (4) naturality: 
local reasoning is rudimentary, and the global structure of proofs directly reflects 
higher-level reasoning. The framework presented is based on ongoing work by 
Reuben Rowe and the author, some of which can be found in [26]29[28]23]. We 
conclude the paper by briefly discussing two major open research questions 
in the field of non-well-founded theory: namely, the need for a user-friendly 
implementation of the method into modern proof assistants, in order to make it 
applicable and to facilitate advancements in automated proof search and program 
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verification, and the task of determining the precise relationship between systems 
for cyclic reasoning and standard systems for explicit reasoning. 


2 The Principles of Induction and Coinduction 


A duality between the induction principle and the coinduction principle is clearly 
observed when formulating them within an algebraic, or categorical, framework. 
This section reviews such a general algebraic formalization (Section |2.1), and 
then presents transitive (co)closure logic, which will serve as our running example 
throughout this paper as it provides simple, yet very intuitive, inductive and 
coinductive notions (Section 2.2}. 


2.1 Algebraic Formalization of Induction and Coinduction 


Both the induction principle and the coinduction principle are usually defined 
algebraically via the concept of fixed points, where the definitions vary in different 
domains such as order theory, set theory or category theory. We opt here for 
a set-theoretical representation for the sake of simplicity, but more general 
representations, e.g., in a categorical setting, are also well-known [71]. 

Let Y : g(D) > p(D) be a monotone operator on sets for some fixed domain 
D (where (D) denotes the power set of D). Since (p(D), C) is a complete lattice, 
by the Knaster—Tarski theorem, both the least-fixed point and greatest-fixed 
point of W exist. The least-fixed point (u) is given by the intersection of all its 
prefixed points—that is, those sets A satisfying (A) C A—and, dually, the 
greatest-fixed point (v) is given by the union of all its postfixed points—that is, 
those sets A satisfying A C W(A). These definitions naturally yield corresponding 
induction and coinduction principles. 


Induction Principle: W(A)CA = > WW) CA 
Coinduction Principle: ACW(A) = ACDr(Y) 


The induction principle states that u(¥) is contained in every W-closed set, where 
a set A is called W-closed if, for all a € A and b € D, (a,b) € W(A) implies 
b € A (which means that u(W) = Q{A | W(A) C A}). The coinduction principle 
dually states that v(W) contains every W-consistent set, where a set A is called 
W-consistent if, for all a € A, there is some b € D such that both (a,b) € W(A) 
and b € A (which means that v(W) = U{A | A C W(a)}). 

The intuition behind an inductively defined set is that of a “bottom-up’ 
construction. That is, one starts with a set of initial elements and then applies 
the constructor operators finitely many times. One concrete example of an 
inductively defined set is that of finite lists, which can be constructed starting 
from the empty list and one constructor operator that adds an element to the 
head of the list. The finiteness restriction stems from the fact that induction 
is the smallest subset that can be constructed using the operators. Using the 
induction principle, one can show that all elements of an inductively defined set 
satisfy a certain property, by showing that the property is preserved for each 
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constructor operator. A coinductively defined set is also constructed by starting 
with a set of initial elements and applying the constructor operators, possibly 
infinitely many times. One example, which arises from the same initial element 
and constructors as the inductive set of lists, is that of possibly infinite lists, 
i.e. the set that also contains infinite streams. The fact that we can apply the 
operators infinitely many times is due to coinduction being the largest subset 
that can (potentially) be constructed using the operators. Using the coinduction 
principle, one can show that an element is in a coinductively defined set. 


2.2 Transitive (Co)closure Operators 


Throughout the paper we will use two instances of fixed points that provide a 
minimal framework which captures applicable forms of inductive and coinductive 
reasoning in an intuitive manner, and is more amenable for automation than 
the full theory of fixed points. This section introduces these fixed points and 
discusses the logical framework obtained by adding them to first-order logic. 


Definition 1 ((Post-)Composition Operator). Given a binary relation, X, 
Wx is an operator on binary relations that post-composes its input with X, that 
is Wx(R)=XuU(Xo R) = {(a,c) | (a,c) © X V Ab. (a,b) E X A (b,c) € R} 


Because unions and compositions are monotone operators over a complete 
lattice, so are composition operators, and therefore both p(Wx) and v(Wx) exist. 
A pair of elements, (a,b), is in (Wx) when b is in every X-closed set that can be 
reached by some X-steps from a, which is equivalent to saying that there is a finite 
(non-empty) chain of X steps from a to b. A pair of elements, (a,b), is in v(Wx) 
when there exists a set A that contains a such that the set A \ {b} is X-consistent, 
which is equivalent to saying that either there is a finite (non-empty) chain of X 
steps from a to b, or there is an infinite chain of X steps starting from a. 

The u(x) operator is in fact the standard transitive closure operator. Extend- 
ing first-order logic (FOL) with the addition of this transitive closure operator 
results in the well-known transitive closure logic (a.k.a. ancestral logic), a generic, 
minimal logic for expressing finitary[] inductive structures [48735242523]. 
Transitive closure (TC) logic was recently extended with a dual operator, called 
transitive co-closure, that corresponds to v(Wx ) [27]. The definition below presents 
the syntax and semantics of the extended logic, called Transitive (co)Closure 
logic, or TcC logic. 


Definition 2 (TcC Logic). For o a first-order signature, let s, t and P range 
over terms and predicate symbols over o (respectively), and let M be a structure 
foro, andv a valuation in M. 


Syntax. The language Lrec (over o) is given by the following grammar: 


yg, pru=s=t|P(h,...,tr)|wleAdvlevelpry|Ve.¢|az.e| 
TC x,y 9)(8,t) | (TCR, 9) (s, t) 


1 See [40] for a formal definition of “finitary” inductive definitions. 
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where the variables x,y in the formulas (TC xy p)(s,t) and (TC )(s, t) 
are distinct and are bound in the subformula ọ. 


Semantics. The satisfaction relation M,v = ọ extends the standard satisfaction 
relation of classical first-order logic with the following clauses: 


M,v =| (TCz y p)(s,t) © 
J(di)i<n . dı = v(s) A dn = v(t) AVi < n. M,v|x := di, y := distil] = p 


M,v H (TCs, p)(s,t) S 
(di)i>o - dı =v(s) A Vi > 0. di = v(t) V M, v|zx ; di, y : di+1] Y 


where v[x1 := dn, ..-, En := dn] denotes the valuation that maps «x; to d; and 
behaves as v otherwise; p {2, ea ta } denotes simultaneous substitution; 


and (di)i<n and (di)i>o denote, respectively, non-empty finite and (countably) 
infinite sequences of elements from the domain. 


Intuitively, the formula (TC'z,, y)(s,t) asserts that there is a (possibly empty) 
finite y-path from s to t, while the formula (TC%°, y)(s,t) asserts that either 
there is a (possibly empty) finite y-path from s to t, or an infinite y-path starting 
at s. For simplicity of presentation we take here the reflexive forms of the closure 
operators, which yields the following correspondence 


Proposition 1. Let [y]M;” := {(a,b) | M, vz :=a,y := b] H p}. 


(i) M,v = (TCayp)(s,t) & v(s) =v(t) or (v(s),v(t)) € Up) 
(ii) M,v | (TCR, y)(s,t) < v(s) =v(t) or (v(s),v(t)) € v(m) 


Note that, unlike the situation in standard fixed-point logics, the two closure 
operators are not inter-definable. The TC operator is definable in arithmetics 
(i.e. in Peano Arithmetics, PA), but the TC®°? operator is not. 

Thus, TcC logic is subsumed by fixed-point logics, such as the first-order 
p-calculus [64], but the concept of the transitive (co)closure is intuitively simpler 
than that of general fixed-point operators, and it does not require any syntactic 
restrictions to ensure monotonicity. In fact, due to its complexity and generality, 
the investigation of the full first-order u-calculus tends to focus only on variants 
and fragments, and is mainly concentrated on the logical and model-theoretic 
aspects, lacking a comprehensive proof theory Another reason for focusing on 
these (co)closure operators is that they allow for the embedment of many forms of 
inductive and coinductive reasoning within one concise logical framework. Thus, 
while other extensions of FOL with inductive definitions are a priori parametrized 
by a set of inductive definitions [59/60/79]19], bespoke induction principles do 


? The definition of the post-composition operator can be reformulated to incorporate 
the reflexive case, however, we opt to keep the more standard definition. 

3 Proof theory has been developed for the propositional modal ji-calculus fragment Bi, 
and recently also for matching p-logic [20J2T]22] which generalizes the p-calculus. 
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not need to be added to TcC logic; instead, applicable (co)induction schemes are 
available within a single, unified language. This conciseness allows the logic to 
be formally captured using one fixed set of inference rules, and thus makes it 
particularly amenable for automation. Moreover, in TcC logic, the same signature 
is shared for both inductive and coinductive data, making certain aspects of the 
relationship between the two principles more apparent. 

Defining infinite structures via the coclosure operators in TcC logic leads to a 
symmetric foundation for functional languages where inductive and coinductive 
data types can be naturally mixed. For example, using the standard list con- 
structors (the constant nil and the (infix) binary function symbol ‘::’) and their 
axiomatization, the collections of finite lists, possibly infinite lists, and infinite 
lists (i.e., streams) are straightforwardly definable as follows. 


List(a) := (TCz,y Ja. x =a::y)(a, nil) 
List (co) := (TCSP 


ay A 


(TCP, da. x =a: y Ay Anil)(o,nil) Ao A nil 


a. x = a::y)(a, nil) 


li 


Stream(c) 


TcC logic also naturally captures properties of, and functions on, streams [29]. 


3 Non-well-founded Deduction for Induction 


This section presents the general method of non-well-founded proof theory (Sec- 
tion (3-1), and then provides a concrete example of a non-well-founded proof 
system for inductive reasoning in the setting of the transitive closure (Section 8.2), 
where the implicit form of inductive reasoning is then compared against the 
explicit one. Note that this section first presents the proof theory only for TC 
logic, which is the inductive fragment of TcC logic, i.e., the one based only on 
the transitive closure operator. 


3.1 Non-well-founded Proof Theory 


The method of non-well-founded proofs provides an alternative approach to 
explicit inductive reasoning by exploiting the fact that there are no infinite 
descending chains of elements of well-ordered sets. Clearly, not all non-well- 
founded proof trees constitute a valid proof, i.e. a proof of the validity of the 
conclusion in the root. A proof tree that simply has one loop over the conclusion 
or one that repeatedly uses the substitution or permutation rules to obtain cycles 
are examples of non-well-founded proof trees that one would not like to consider 
as valid. Thus, a non-well-founded proof tree is allowed to be infinite, but to 
be considered as a valid proof, it has to obey an additional requirement that 
prevents such unsound deductions. Hence, non-well-founded proofs are subject to 
the restriction that every infinite path in the proof admits some infinite descent. 
Intuitively, the descent is witnessed by tracing syntactic elements, terms or 
formulas, for which we can give a correspondence with elements of a well-founded 
set. In this respect, non-well-founded proof theory enables a separation between 
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local steps of deductive inference and global well-foundedness arguments, which 
are encoded in traces of terms or formulas through possibly infinite derivations. 

Below we present proof systems in the style of sequent calculus. Sequents are 
expressions of the form I’ => A, for finite sets of formulas I’ and A. We write T, y 
as a shorthand for T'U {y}, and fv(I’) for the set of free variables of the formulas 
in I’. A sequent I = A is valid if and only if the formula Aper Y > Vyca is. 

Let S be a collection of inference rules. First, we define the notion of a 
non-well-founded proof tree, a pre-proof, based on S. 


Definition 3 (Pre-proofs). A pre-proof in S is a possibly infinite derivation 
tree formed using the inference rules of S. A path in a pre-proof is a possibly 
infinite sequence of sequents, 80,81,-.-(,5n), such that so is the root sequent of 
the proof, and s;+1 is a premise of si in the derivation tree for each i < n. 


As mentioned, not every pre-proof is a proof: only those in which there is some 
notion of infinite descent in every infinite branch, which allows one to formalize 
inductive arguments. To make this concrete, one picks some syntactic element, 
which can be formulas or terms, to be tracked through a pre-proof. We call such 
elements traced elements. The intuition behind picking the traced elements is that 
eventually, when we are given a pre-proof, we could trace these elements through 
the infinite branches, and map them into some well-founded set. This is what 
underpins the soundness of the non-well-founded method, as explained below. 
Given certain traced elements, we inductively define a notion of trace pairs which 
corresponds to the appearances of such traced elements within applications of 
the inference rules throughout the proof. That is, for traced elements, 7,7’, and a 
rule with conclusion s and a premise s’ such that 7 appears in s and 7’ appears 
in s’, (7,7’) is said to be a trace pair for (s,s’) for certain rule applications, 
and there has to be at least one case identified as a progressing trace pair. The 
progression intuitively stands for the cases in which the elements of the trace pair 
are mapped to strictly decreasing elements of the well-founded set. We provide a 
concrete example of traced elements and a trace pair definition in the transitive 
closure setting in Section [3.2] 


Definition 4 (Traces). A trace is a (possibly infinite) sequence of traced el- 
ements. We say that a trace 71,72,..-(,T») follows a path 81, 82,...(,8m) in a 
pre-proof P if, for some k > 0, each consecutive pair of formulas (7;,T;11) is a 
trace pair for (Si+k, Si+k+1)- If (Ti, Ti41) is a progressing pair, then we say that 
the trace progresses at i, and we say that the trace is infinitely progressing if it 
progresses at infinitely many points. 


Proofs, then, are pre-proofs which satisfy a global trace condition. 


Definition 5 (Infinite Proofs). A proof is a pre-proof in which every infinite 
path is followed by some infinitely progressing trace. 


We denote by S° the non-well-founded proof system based on the rules in S. 
The general soundness argument for such infinite systems follows from a 
combination of standard local soundness of the inference rules in S together 
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ron) DP Aelis} T> A, (Try ort 
T = A, (TCs, p)(s, s8) T>A,(TC., ¢)s,8) 


(TCR) 


Pe=t>A P(t, i} TC Ae] A 
T, (TCxy p)(s,t) > A 
T(z), p(x, y) > 4,4 {3} , 
T, p {3}, (TCzy p)(s,t) > 4,4 {4} 


where in (TC%"), z g f(T, A, (TCz,y v)(s,t)), and in (TCF), x ¢ f(T, A) and y ¢ fy(T, A, w). 


trom) 


70) 


Fig. 1: Proof rules for the TC operator 


with a global soundness argument via an infinite descent-style construction, due 
to the presence of infinitely progressing traces for each infinite path in a proof. 
One assumes for contradiction that the conclusion of the proof is invalid, which, 
by the local soundness of the rules, entails the existence of an infinite sequence 
of counter-models, going along an infinite branch. Then, one demonstrates a 
mapping of these models into a well-founded set, (D, <), which decreases while 
following the sequence of counter-models, and strictly decreases when going 
over progression points. But then, by the global trace condition, there exists an 
infinitely descending chain in D, which of course yields a contradiction. 

While a full infinitary proof system is clearly not effective, effectiveness can 
be obtained by restricting consideration to the cyclic proofs, i.e., those that are 
finitely representable. These are the regular infinite proof trees, which contain 
only finitely many distinct subtrees. Intuitively, the cycles in the proofs capture 
the looping nature of inductive arguments and, thereby, the cyclic framework 
provides the basis for an effective system for automated inductive reasoning. A 
possible way of formalizing such proof graphs is as standard proof trees containing 
open nodes, called buds, to each of which is assigned a syntactically equal internal 
node of the proof, called a companion (see, e.g., [I9] Sec.7] for a formal definition). 


Definition 6 (Cyclic Proofs). The cyclic proof system S® is the subsystem 
of SY comprising of all and only the finite and regular infinite proofs (i.e., those 
proofs that can be represented as finite, possibly cyclic, graphs). 


3.2 Explicit vs. Implicit Induction in Transitive Closure Logic 


Since we focus on the formal treatment of induction in this section, we here 
present the proof systems for TC logic, i.e., the logic comprising only the TC 
operator extension. Both proof systems presented are extensions of LK _, the 
sequent calculus for classical first-order logic with equality [44]. 

Figure [1] presents proof rules for the TC operator. Rules (TC res), (TCR) 


assert the reflexivity and the transitivity of the TC operator, respectively. Rule 


4 Here LK includes a substitution rule, which was not a part of the original systems. 
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(TC*") can be intuitively read as follows: if the extension of w is y-closed, then 
it is also closed under the reflexive transitive closure of y. Rule (TC%”") is in 
a sense a case-unfolding argument, stating that to prove something about the 
reflexive transitive closure of y, one must prove it for the base case (i.e., s = t) 
and also prove it for one arbitrary decomposition step (i.e., where the y-path is 
decomposed to the first step and the remaining path). 

The explicit (well-founded) proof system Src is based on rules (TC rep), (TC r) 
and (TCf"). The implicit (non-well-founded) proof system S is based on rules 
(TC ref), (TC r) and (TC%"), and its cyclic subsystem is denoted by S¥¢. In $3, 
the traced elements are TC formulas on the left-hand side of the sequents, and 
the points of progression are highlighted in blue in Figure {1} The soundness of 
the S% system is then underpinned by mapping each model of an TC formula 
of the form (TCz,, y)(s,t) to the minimal length of the y-path between s and t. 

Rules (TC¢*) and (TC%") both offer a unified treatment of inductive reason- 
ing, in the sense that bespoke induction principles do not need to be added to 
the systems. A big advantage of the implicit system is that it can ameliorate 
the major challenge in automating inductive reasoning of finding the induction 
invariant a priori. Indeed, a major difference between these two induction rules 
is the presence of the induction invariant. In (TC§"), unlike in (TC%"), there is 
an explicit appearance of the induction invariant, namely wW. Instead, in SX, the 
induction invariant, which is often stronger than the goal one is attempting to 
prove, can (usually) be inferred via the cycles in the proof. 

Since TC logic subsumes arithmetics, by Gédel’s result, the system Src, while 
sound, is incomplete with respect to the standard semantics} Nonetheless, the 
full non-well-founded proof system SX- is sound and (cut-free) complete for 
TC logic [28]26]. Furthermore, the cyclic subsystem S¥¢ subsumes the explicit 
system Src. 


4 Adding Coinductive Reasoning 


This section extends the non-well-founded proof theory of TC logic from Sec- 
tion [3.2] to support the transitive coclosure operator, and thus the full TcC logic 
(Section [4.1p. We then provide an illustrative example of the use of the resulting 
framework, demonstrating its potential for automated proof search (Section [4.2}. 


4.1 Implicit Coinduction in Transitive (Co)closure Logic 


The implicit (non-well-founded) proof system for TcC logic, denoted S#2¢, is an 
extension of the system St, obtained by the addition of the proof rules for the 
TC? operator presented in Figure |2} Again, rules (TCs), (TCR) state the 
reflexivity and transitivity of the TC°? operator, respectively, and rule (TCP) 
is a case-unfolding argument. However, unlike the case for the TC°? operator in 
which rule (TC%”) can be replaced by a rule that decomposes the path from the 


5 Src is sound and complete with respect to a generalized form of Henkin semantics [23]. 
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P>A,p{z,t} [=> A, (TCP, ¢)(r,t) 


PA, (TCR, 9Ns,8) es 
’ T,Y , I > A, (TCLs y)(s, t) 


I,s=t>A T, {2,2}, (TCZ, p)(z,t) > A 


Yy 


(To) 
T, (TCP, p)(s,t) > A ° 


where in (TC), z Z fFu(I’, A, (TCP y ~)(s, t)). 


Fig. 2: Proof rules for the TCP operator 


end, in rule (TC?) it is critical that the decomposition starts at the first step 
(as there is no end point). Apart from the additional inference rules, SX¢¿ also 
extends the traced elements to include TC°? formulas, which are traced on the 
right-hand side of the sequents, and the points of progression are highlighted in 
pink in Figure [2] 

Interestingly, the two closure operators are captured proof-theoretically using 
inference rules with the exact same structure. The difference proceeds from 
the way the decomposition of the corresponding formulas is traced in a proof 
derivation: for induction, TC formulas are traced on the left-hand sides of the 
sequents; for coinduction, TC®? formulas are traced on the right-hand sides of 
sequents. Thus, traces of TC formulas show that certain infinite paths cannot 
exist (induction is well-founded), while traces of TC°? formulas show that other 
infinite paths must exist (coinduction is productive). This formation of the rules 
for the (co)closure operators is extremely useful with respect to automation, as 
the rules are locally uniform, thus enabling the same treatment for induction 
and coinduction, but are also globally dual, ensuring that the underlying system 
handles them appropriately (at the limit). Also, just like the case for induction, 
the coinduction invariant is not explicitly mentioned in the inference rules. 

The full non-well-founded system SX is sound and (cut-free) complete with 
respect to the semantics of TcC logic [27]. It has been shown to be powerful enough 
to capture non-trivial examples of mixed inductive and coinductive reasoning 
(such as the transitivity of the substream relation), and to provide a smooth 
integration of induction and coinduction while also highlighting their similarities. 
To exemplify the naturality of the system, Figure [3] demonstrates a proof that 
the transitive closure is contained within the transitive co-closure. The proof has 
a single cycle (and thus a single infinite path), but, following this path, there is 
both a trace, consisting of the TC formulas highlighted in blue, and a co-trace, 
consisting of the TC formulas highlighted in pink (the progression points are 
marked with boxes). Thus, the proof can be seen both as a proof by induction 
and as a proof by coinduction. 


4.2 Applications in Automated Proof Search 


The cyclic reasoning method seems to have enormous potential for the automation 
of (co)inductive reasoning, which has not been fully realized. Most notably, as 
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(TCx,y p)(usv) > (TCP y P)(u, v) 


(Sub) 
(TCa y p)(w, v) => (TCP, p)(w, v) 

(Ax) (Wk) 

pfe L}, (TOs pwo) pf e} ofa, e}, (TCa,y 9)(w,v) =| (TER, p) (w, v) 


R 
fz, 2}, [(T0zu P(w, v)] => (TOP, pu v) 


o 
(roves) 


= (TCP, P) (u, u) 


(Eq) 
u =v => (TOP, ¢)(u,v) 


oi a eh EE E N Sate E EES 


crot") 
(TCay p)(u 0) > (TOR, p)(uyv) € 2 = = 2 ee eee ee eee een eee 


Fig. 3: Proof that the TC°? operator subsumes the TC operator 


mentioned, cyclic systems can facilitate the discovery of a (co)induction invariant, 
which is a primary challenge for mechanized (co)inductive reasoning] Thus, in 
implicit systems, the (co)inductive arguments and hypotheses may be encoded 
in the cycles of a proof, in the sense that when developing the proof, one can 
start with the goal and incrementally adjust the invariant as many times as 
necessary. Roughly speaking, one can perform lazy unfolding of the (co)closure 
operators to a point in which a cycle can be obtained, taking advantage of 
non-local information retrieved in other branches of the proof. 

The implications of these phenomena for proof search can be examined using 
proof-theoretic machinery to analyze and manipulate the structures of cyclic 
proofs. For example, when verifying properties of mutually defined relations, 
the associated explicit (co)induction principles are often extremely complex. In 
the cyclic framework, such complex explicit schemes generally correspond to 
overlapping cycles. Exploring such connections between hard problems that arise 
from explicit invariants and the corresponding structure of cyclic proofs, can 
facilitate automated proof search. The cyclic framework offers yet another benefit 
for verification in that it enables the separation of the two critical properties 
of a program, namely liveness (termination) and safety (correctness). Thus, 
while proving a safety property (validity of a formula), one can extract liveness 
arguments via infinite descent. 


4.2.1 Program Equivalence in the TcC Framework 


The use of the (co)closure operators in the TcC framework seems to be particularly 
well-suited for formal verification, as these operators can be used to simultaneously 
express the operational semantics of programs and the structure of the (co)data 
manipulated by them. Use of the same constructors for both features of the 
program constitutes an improvement over current formal frameworks, which 


6 Some verification approaches can discover inductive invariants automatically [43]45], 
or direct their construction based on the property being verified [63/50], but they do 
not currently support coinductive reasoning. 
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rest := fix rest(f).An. if n > 0 then (output n; rest f (n — 1)) else f 0 
f := fix f(n). let v = (output n; input()) » 2 in (if v Æ 0 then f else rest f) (v +n) 
g := fix g(m). output (2*m); let v = input() in if v = 0 then rest g (2 m) else g (v +n) 


RES := 
(PC Gig seg) ruay (uy > 0 Av = u1 —1A u2 = u1 v2) V (u1 = v1 = OA u2 = v2))((n, 8), (0, s’)) 
We = di,w. rg =t::wA 


[(i*2 #0 Ayı =ix*2 +g Aw=21::y2) V (i = yı = 0 A RES(21, w, £1 2: y2))] 


Pg = Hi, w. to =i: WA 


[(i #0 Ayı =i +z Aw = (2 * x1): y2) V (i = yı = OA RES(2 * x1, w, (2 * 21) :: y2))] 


SPEC: (TOU no) (uiua) YE) U2 * m, 8), (L, L)) > (TOE ao) (1,2) Ve) Mm, s), (1, L)) 


Fig. 4: The recursive programs and their formalization in TcC 


usually employ qualitatively different formalisms to describe the operational 
semantics of programs and the associated data[] For instance, although many 
formalisms employ separation logic to describe the data structures manipulated 
by programs (e.g., the Cyclist prover [I8]), they also encode the relationships 
between the program’s memory and its operational behavior via bespoke symbolic- 
execution inference rules [10165]. 

To demonstrate the capabilities and benefits of the TcC framework for verifica- 
tion and automated proof search, we present the following example, posed in 
Sec. 3]. The example consists of proving that the two recursive programs given 
in Figure |4| (weakly) simulate one another. Both programs continually read the 
next input, compute the double of the sum of all inputs seen so far, and output 
the current sum. On input zero, both programs count down to zero and start 
over. The goal is to formally verify that g(m) is equivalent to f(2m). However, as 
noted in [47], a formal proof of this claim via the standard Tarskian coinduction 
principle is extremely laborious. This is mainly because one must come up with 
an appropriate “simulation relation” that contains all the intermediate execution 
steps of f and g, appropriately matched, which must be fully defined before we 
can even start the proof. 

The (co)closure operators offer a formalization of the problem which is very 
natural and amenable to automation, formalizing the programs by encoding all 
(infinite) traces of f and g as streams of input/output events. Hence, the simulation 
amounts to the fact that each such stream for f can be simulated by g, and vice 
versa. The bottom part of Figure [4]shows the formalization of the specification 
in TcC logic, where the encoding of each program is a natural simplification that 
can easily (and automatically) be obtained from either structural operational 
semantics or Floyd—Hoare-style axiomatic semantics. We use | as a designated 
unreachable element (i.e., an element not related to any other element). The fact 


T Notable exceptions include [66[76]20]21]22], which take a similar approach but invoke 
second-order elements. 
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Fig. 5: Structure of the proof of one direction of SPEC 


that the (co)closure operators can be applied to complex formulas that include, 
for example, quantifiers, disjunctions and nesting of the (co)closure operators, 
enables a concise, natural presentation without resorting to complex case analysis. 
This offers a significant a priori simplification of the formula we provide to the 
proof system (and, in turn, to a prover), even before starting the proof-search 
procedure. 


The cyclic proof system, in turn, enables a natural treatment of the coinduc- 
tive reasoning involved in the proof, in a way that is particularly amenable to 
automation. Figure [5] outlines the structure of the proof of one direction of the 
equivalence defined in SPEC. For conciseness, the subscripts (£1, £2), (y1, Y2) are 
omitted from all TC°? formulas and we use (TC°? p); ((u,v)) as a shorthand 
for (TC? vy)((u,v),(L,1)). The proof is compact and the local reasoning is 
standard: namely, the unfolding of the TCP operator. The proof begins with 
a single unfolding of the TC°? formula on the left and then proceeds with its 
unfolding on the right. The key observation is that the instantiation of the 
unfolding on the right (i.e., the choice of the term r in Rule (TC%))) can be 
automatically inferred from the terms of the left unfolding, by unification. Thus, 
when applying Rule (TC), one does not have to guess the intermediate term 
(in this case, (21/2, z2)); instead, the term can be automatically inferred from 
the equalities in the subproof of the single-step implication, as illustrated by the 
green question marks in Figure 


Finally, to formally establish the correctness of our simplified formalization, 
one needs to prove that, for example, the abstract RES(n, s, s”) is indeed equivalent 
to the concrete program restart on f and on g. This can be formalized and proved 
in a straightforward manner, as the proof has a dual structure and contains a TC 
cycle. This further demonstrates the compositionality of TcC framework, as such 
an inductive subproof is completely independent of the general, outer coinductive 
TC cycle. 
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5 Perspectives and Open Questions 


As mentioned, the approach of non-well-founded proof theory holds great potential 
for improving the state-of-the-art in formal support for automated inductive 
and coinductive reasoning. But the investigation of cyclic proof systems is far 
from complete, and much work is still required to provide a full picture. This 
section concludes by describing two key research questions, one concerning 
the applicability of the framework and the other concerning the fundamental 
theoretical study of the framework. 


5.1 Implementing Non-well-founded Machinery 


Current theorem provers offer little or no support for implicit reasoning. Thus, 
major verification efforts are missing its great potential for lighter, more legible 
and more automated proofs. The main implementation of cyclic reasoning can be 
found in the cyclic theorem prover Cyclist [I8], which is a fully automated prover 
for inductive reasoning based on the cyclic framework developed in [5609]. 
Cyclist has been very successful in formal verification in the setting of separation 
logic. Cyclic inductive reasoning has also been partially implemented into the 
Coq proof assistant through the development of external libraries and func- 
tional schemas [77]. Both implementations do not support coinductive reasoning, 
however. 

To guarantee soundness, and decide whether a cyclic pre-proof satisfies the 
global trace condition, most cyclic proof systems feature a mechanism that 
uses a construction involving an inclusion between Biichi automata (see, for 
example, [[5[74]). This mechanism can be (and has been) applied successfully 
in automated frameworks, but it lacks the transparency and flexibility that one 
needs in interactive theorem proving. For example, encoding proof validity into 
Buchi automata makes it difficult to understand why a cyclic proof is invalid 
in order to attempt to fix it. Therefore, to fully integrate cyclic reasoning into 
modern interactive theorem provers in a useful manner, an intrinsic criterion for 
soundness must be developed, which does not require the use of automata but 
instead operates directly on the proof tree. 


5.2 Relative Power of Explicit and Implicit Reasoning 


In general, explicit schemes for induction and coinduction are subsumed by their 
implicit counterparts. The converse, however, does not hold in general. In [19], 
it was conjectured that the explicit and cyclic systems for FOL with inductive 
definitions are equivalent. Later, they were indeed shown to be equivalent when 
containing arithmetics [I9], where the embedding of the cyclic system in the 
explicit one relied on an encoding of the cycles in the proof. However, it was 
also shown, via a concrete counter-example, that in the general case the cyclic 
system is strictly stronger than the explicit one [9]. But a careful examination of 
this counter-example reveals that it only refutes a weak form of the conjecture, 
according to which the inductive definitions available in both systems are the 
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same. That is, if the explicit system is extended with other inductive predicates, 
the counter-example for the equivalence no longer holds. Therefore, the less strict 
formulation of the question—namely, whether for any proof in the cyclic system 
there is a proof in the explicit system for some set of inductive predicates—has 
not yet been resolved. In particular, in the TcC setting, while the equivalence 
under arithmetics also holds, the fact that there is no a priori restriction on the 
(co)inductive predicates one is allowed to use makes the construction of a similar 
counter-example in the general case much more difficult. In fact, the explicit and 
cyclic systems may even coincide for TcC logic. 

Even in cases where explicit (co)induction can capture implicit (co)induction 
(or a fragment of it), there are still open questions regarding the manner in which 
this capturing preserves certain patterns. A key question is whether the capturing 
can be done while preserving important properties such as proof modularity. 
Current discourse contains only partial answers to such questions [7577/68] which 
should be investigated thoroughly and systematically. The uniformity provided 
by the closure operators in the TcC setting can facilitate a study of this subtle 
relationship between implicit and explicit (co)inductive reasoning. 


Acknowledgements. As mentioned in the introduction, the TcC framework is 
based on a wonderful ongoing collaboration with Reuben Rowe. The author is also 
extremely grateful to Andrei Popescu and Shachar Itzhaky for their contributions 
to the framework. 
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Abstract. Over the recent years deep learning has found successful ap- 
plications in mathematical reasoning. Today, we can predict fine-grained 
proof steps, relevant premises, and even useful conjectures using neu- 
ral networks. This extended abstract summarizes recent developments 
of machine learning in mathematical reasoning and the vision of the 
N2Formal group at Google Research to create an automatic mathemati- 
cian. The second part discusses the key challenges on the road ahead. 
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1 Introduction 


The combination of machine learning and mathematical reasoning goes back at 
least to the 2000s when Stephan Schulz pioneered ideas to use machine learning 
to control the search process [44], and Josef Urban used machine learning to 
select relevant axioms [46,47]. With the advent of deep learning, interest in the 
area surged, as deep learning promises to enable the automatic discovery of new 
knowledge from data, while requiring minimal engineering. This suddenly offered 
a flurry of new possibilities also for theorem proving. 

One of the most challenging and impactful tasks in automated theorem prov- 
ing is premise selection, that is to find relevant premises from a large body of 
available theorems/axioms. Many classical reasoning systems do not scale well 
into thousands of potentially relevant facts, but some pioneering results by Urban 
et al. [47] proposed fast machine learning techniques using manually engineered 
features. However, with the inroads of deep learning, it has become clear that 
large quality improvements are possible by utilizing deep learning techniques. 
DeepMath [24] demonstrated that premise selection could be tackled with deep 
learning, directly (i.e., without feature engineering) applying neural networks to 
the text of the premise and that of the (negated) conjecture. 

In DeepMath, both premise and conjecture are embedded into a vector space 
by a (potentially expensive) neural network and then a second (preferably cheap) 
neural network compares the embedding of the current state to each available 
premise to judge whether the premise is useful. Loos et al. [36] for the first 
time, demonstrated that the same approach as DeepMath yields substantial 
improvements as an internal guidance method within a first-order automated 
theorem prover. 


© The Author(s) 2021 
A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 25-37, 2021. 
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Neural Theorem Provers. Emboldened by these early works and by break- 
throughs in deep learning, several groups extended interactive theorem provers! 
for the use in deep learning research, including Gamepad [23], HOList [5], Coq- 
Gym [54], GPT-f [39], and recently TacticZero [51]. A typical tactic application 
predicted by these systems looks as follows (here in HOL Light syntax): 


REWRITE_TAC [ PREMISE1 ; PREMISE2 ] 
nan N y 


tactic name list of premises 


This specific tactic expects the given premises to be equalities, with which it 
attempts to rewrite subexpressions in the current proof goal. The hard part 
about predicting good tactics is to select the right list of premises from all 
the previously proven theorems. Some tactics also include free-form expressions, 
which can be a challenge as well. 

In contrast to approaches using lightweight machine learning approaches 
(e.g. [13,25,26,38,31]), neural theorem provers aim to replicate the human ap- 
proach to proving theorems in ITPs, searching only through a relatively small 
number (e.g., hundreds) of proof steps that are very promising. To get high- 
quality proof steps, increasingly large neural networks (currently up to 774M 
parameters [39]) are trained on human proofs, or with reinforcement learning. 

Already, neural theorem provers can prove a significant portion (up to 70% 
[4]) of test theorems and some have found proofs that are shorter and more 
elegant than the proofs that human mathematicians have formalized in these 
systems. For example, for the theorem CLOSURE_CONVEX_INTER_AFFINE, proven 
with over 40 tactic calls in HOL Light [20], HOList /DeepHOL has found a proof 
with just two tactic calls: 


let CLOSURE_CONVEX_INTER_AFFINE = prove 
(‘!s t:real*N->bool. 
convex s /\ affine t /\ ~(relative_interior s INTER t = {}) 
==> closure(s INTER t) = closure(s) INTER t‘, 


SIMP_TAC [INTER_COMM; AFFINE_IMP_CONVEX; 
CLOSURE_INTER_CONVEX; RELATIVE_INTERIOR_AFFINE] 


THEN 


ASM_MESON_TAC [RELATIVE_INTERIOR_EQ_CLOSURE; INTER_COMM; 
RELATIVE_INTERIOR_UNIV; IS_AFFINE_HULL]);; 


1 The focus has been on interactive theorem provers as they are general enough to 
capture most of mathematics in theory, and several large-scale formalization efforts 
of the last decades have demonstrated that involved theories can be formalized in 
practice [28,14,19]. Also ITPs offer relatively short proofs compared to other auto- 
mated reasoning tools, which allows us to use stronger neural networks for the same 
computational budget. 
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Similarly, Polu et al. reported several cases where they found proofs with their 
neural theorem prover GPT-f that were shorter and more elegant than than 
those found by humans [39]. 


Neural Solvers. Closely related to neural theorem provers are methods that, 
instead of predicting proof steps, directly predict the solution to mathematical 
problems. A first impressive example was proposed by Selsam et al., who showed 
that graph neural networks can predict satisfying assignments of small Boolean 
formulas [45]. Lample and Charton have demonstrated that also higher-level rep- 
resentations, such as the integral of a formula, can be predicted directly using 
a Transformer [29]. They exploited the fact that for some mathematical opera- 
tions, such as taking the integral, the inverse operation (taking the derivative) is 
much easier. Hence, they can train on predicting generated formulas from their 
derivative without needing a tool that can generate the integral in the first place. 
Recently, Hahn et al. demonstrated that also classical verification problems, such 
as LTL satisfiability, can be solved directly with Transformers, beating existing 
tuned algorithms on their own dataset in some cases [18]. 


2 Towards the Automatic Mathematician 


We are convinced that the success of neural theorem provers and neural solvers is 
only the beginning of a larger development in which deep learning will revolution- 
ize automated reasoning, and have set out to build an automatic mathematician. 
Ideally, we could simply talk to an automatic mathematician like a colleague, 
and it would be able to contribute to mathematical research, for example by 
publishing papers without human support. 

An automatic mathematician would thus go far beyond theorem proving, as 
it would have to formulate and explore its own theories and conjectures, and be 
able to communicate in natural language. Yet, we believe that neural theorem 
provers are an important instrument of our plan, as they allow us to evaluate 
(generated) conjectures, which grounds the learning process in mathematically 
correct reasoning steps. And because neural theorem provers build on existing 
interactive theorem provers, they already come with a nucleus of formalized 
mathematics that we believe might be necessary to bootstrap the understanding 
of mathematics. In the following, we review some of the main challenges on the 
path towards an automatic mathematician and first approaches to address them. 


2.1 Neural Network Architectures 


Naturally, we need neural network architectures that can “understand” formulas, 
that is, make useful predictions based on formulas. The main question for the 
design of neural networks appears to be whether and, if yes, how to exploit the 
tree structure of formulas. 
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Exploiting the Structure of Formulas. It is tempting to believe that the embed- 
dings of formulas should represent their semantics. Hence, many authors have 
suggested to process formulas with tree-structured recurrent neural networks 
(TreeRNNs), which compute embeddings of expressions from the embeddings 
of their subexpressions, as this resembles the bottom-up way we define their 
semantics (e.g., [11,1,23,54]). That intuition, however, may be misleading. In 
our experiments, bottom-up TreeRNNs have performed significantly worse than 
top-down architectures (followed by a max-pool aggregation) [37]. This suggests 
that, to make good predictions based on formulas, it is important to consider 
subformulas in their context, which bottom-up TreeRNNs cannot do easily. 


Sequence Models. The alternative to representing the formula structure in the 
neural architecture is to interpret formulas simply as sequences of characters 
or symbols and apply sequence models. Early works using sequence model- 
ing relied on convolutional networks (simple convolutional networks [24] and 
wave-nets [36,5]), which compared favorably to gated recurrent architectures 
like LSTM/GRU. With the recent rise of the Transformer architecture [48] se- 
quence models have caught up to those that exploit the formula structure and 
yielded excellent performance in various settings [29,41,52,39,18)]. 

Sequence models come with two major advantages: First, it is straight- 
forward to not only read formulas, but also generate formulas, which is sur- 
prisingly challenging with TreeRNNs or graph neural networks. This allows us 
to directly predict proof steps as strings [39,52], and to tackle a wider range of 
mathematical reasoning tasks, such as predicting the integral of a formula [29], 
satisfying traces for formulas in linear temporal logics [18], or even more creative 
tasks, such as missing assumptions and conjectures [41].? Second, transformer 
models have shown a surprising flexibility and promise a uniform way to process 
not only formulas, but also natural language, and even images [10]. This could 
prove crucial for processing natural language mathematics, which frequently con- 
tains formulas, text, and diagrams, and any model processing papers would need 
to understand how they relate to each other. Transformers certainly set a high 
bar for the flexibility, generality, and performance of future neural architectures. 


Large Models. Scaling up language models to larger and larger numbers of pa- 
rameters has steadily improved their results [27,22]. Also when we use language 
models for mathematics, we have observed that larger models tend to improve 
the quality of predictions [39,41]. GPT-3 has shown that certain abilities, such 
as basic arithmetic, appear to only materialize in models with at least a certain 
number of parameters [6]. If this turns out to be true for other abilities, this raises 
the question how large models have to be to exhibit human-level mathematical 
reasoning abilities. 


2 Yet, there are still cases where hard-coding some formula structure in transformer 
architectures can improve the results, as shown, for example, by Wu et al. [21,35,18], 
which suggests that transformers are not the end of the story regarding formula 
understanding. 
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There is also the question of how exactly to scale up models. The mere 
number of parameters may not be as important as how we use them. More 
efficient alternatives to simply scaling up the transformer architecture might help 
with the problem to make large models accessible to more researchers (e.g., [32]). 


2.2 Training Methodology 


Neural networks have shown the ability to learn even advanced reasoning tasks 
via supervised learning, given the right training data. However, for many inter- 
esting tasks, we do not have such data and hence the question is how to train 
neural networks for tasks for which we have only limited data or no data at all. 


Reinforcement Learning. Reinforcement learning can be seen as a way to re- 
duce the amount of human-written proof data needed to learn a strong theorem 
prover. By training on the proofs generated by the system itself, we can improve 
its abilities to some extent, and the perhaps strongest neural theorem provers 
often use some form reinforcement learning (e.g., up to 70% of the proofs in 
HOL Light [4]). But, for an open-ended training methodology, we need a sys- 
tem that can effectively explore new and interesting theories, without getting 
lost in irrelevant branches of mathematics. Partial progress has been made in 
training systems without access to human-written proofs [4,51], and to generate 
conjectures to train on in a reinforcement learning setting [12], but the problem 
is wide-open. 


Pretraining. In natural language understanding it is already common practice to 
pretrain transformers on a large body text before fine-tuning them on the final 
task, especially when only limited data is available for that task. Even though the 
pretraining data is only loosely related to the final tasks, transformers benefit a 
lot from pretraining, as it contains general world knowledge and useful inductive 
biases [9]. Polu et al. have shown that the same can be observed when pretraining 
transformers on natural language texts from arXiv [39]. 


Self-supervised Training. The GPT models for natural language have shown that 
self-supervised language modeling (i.e., only “pre” training without training on 
any particular task) alone can equip transformers with surprising abilities [42,6]. 
Mathematical reasoning abilities, including type inference, predicting missing 
assumptions and conjecturing, can be learned in a very similar way by training 
transformers to predict missing subexpressions (skip-tree training) [41]. 

Lample et al. devised several clever approaches to train transformers when 
data is not directly available. In unsupervised translation training transformers 
successfully learn to translate between different natural languages starting only 
with monolingual corpora and without any corresponding pairs of sentences [30]. 
This approach was even generalized to learn to translate between programming 
languages without corresponding pairs of programs in different languages [43]. 
The application of these unsupervised translation ideas to mathematics is tempt- 
ing, but we experienced that their straight-forward application does not lead to 
good results. Also Wang et al. [49] report mixed results. 
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Learning to Retrieve Relevant Information. If we apply standard language mod- 
els to mathematics, e.g., to predict the next proof step, we expect them to store 
all the information necessary to make good predictions in their parameters. As 
the large transformer models have shown (see, e.g., GPT [42,6]), this approach 
actually works pretty well for natural language question answering, and also for 
mathematical benchmarks it has been surprisingly successful [41,39,53]. How- 
ever, there may be a limit to this approach in cases where we expect detailed, 
consistent, and up-to-date predictions. Guu et al. [17] introduced a hybrid of 
transformer and retrieval model, REALM, which learns to retrieve Wikipedia 
articles that are relevant to a given question and extract useful information 
from the article. REALM is trained self-supervised to retrieve multiple articles 
and try to use each of them individually to make predictions. The article that 
led to the best prediction is deemed to be the most relevant, and is used to 
train the retrieval query for future training iterations. This approach has been 
extended in follow-up work [33,2,34,3] and appears to be a promising approach 
also to retrieve the relevant context, such as definitions, possible premises, and 
even related proofs, for mathematical reasoning. 


2.3 Instant Utilization of New Premises 


Theorem proving has a key difference compared to other reinforcement learning 
settings: whenever we reach one of the goals, i.e., prove a theorem, we can use 
that goal as a premise for future proof attempts. Any learning method applied 
in a reinforcement learning setting for theorem proving thus needs the ability to 
adapt to this growing action space, and ideally does not need to be retrained at 
all when a new theorem becomes available to be used. 


Premise selection approaches that are built on retrieval, such as DeepMath 
[24,36] and HOList [5,37], offer this ability: When a new theorem is proven, we 
can add it to the list of premises that can be retrieved and future retrieval queries 
can return the statement. This appears to work well, even when the provers are 
applied to a new set of theorems, as demonstrated by the DeepHOL prover when 
it was applied to the unseen Flyspeck theorem database [5]. We can even exploit 
this kind of generalization for exploration and bootstrap neural theorem provers 
without access to human proofs as training data [4]. 


A new challenge arises from the use of language models for theorem proving. 
Theorem provers using transformers currently have no dedicated retrieval mod- 
ule, and instead predict the statements or names of premises as part of the tactic 
string (cf. [39]). In our experience this does not provide the required generaliza- 
tion to unseen premises without retraining. (Though there are experiments that 
suggest that it might be possible [8].) Future approaches will have to find a way 
to combine the strong reasoning skills and generative abilities of Transformer 
models with the ability to use new premises without retraining. 
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2.4 Natural Language 


We believe that, perhaps counterintuitively, natural language plays a central 
role in automated reasoning. The most direct reason is that only a small part of 
mathematics has been formalized so far, and a pragmatic approach to tap into 
much more training data is to find a way to learn from natural language mathe- 
matics (books and papers on mathematical topics). In this section, however, we 
want to look beyond the question of feasibility and training data, and discuss 
the broad advantages of a natural language approach to mathematics. 


Accessibility. A bridge between natural and formal mathematics could help to 
make the system much more accessible, by not requiring the users to learn a 
specific formal language. This might open up mathematics to a much wider au- 
dience, enabling advanced mathematical assistants (think WolframAlpha [50]), 
and tools for education. 

Vice versa, an advanced automatic mathematician without the ability to 
explain their reasoning in natural language might be hard to understand. Even 
if the system’s predictions and theories are correct, sophisticated, and relevant, 
we might not be able to use them to inform our own understanding if the notions 
the system comes up with are only available as vast synthetic formal objects. 


Conjecturing, Theory Exploration, and Interestingness. Various approaches have 
been suggested to produce new conjectures, including heuristic filters [40], de- 
riving rules from data [7], and learning and sampling from a distribution of 
theorems using language modeling [41]. 

A particularly interesting idea is the use of adversarial training to generate 
conjectures (e.g., [12]). Here, two neural networks compete against each other— 
one with the aim to prove statements and the other with the aim to suggest 
hard-to-prove statements, somewhat akin to generative adversarial nets [15]. The 
idea is that the competition between the two networks generates a curriculum of 
harder and harder problems to solve and also automatically explores new parts 
of mathematics (as old parts get easier over time). However, there seems to be 
a catch: Once the network that suggests problems has figured out how to define 
a one-way function, it becomes very easy to produce an unlimited number of 
hard problems, such as to find an input to the SHA256 function that produces 
a certain output hash. This class of problems is almost impossible to solve, and 
thus likely leads the process into a dead-end. 

Once again, natural language seems to be a possible answer. Using the large 
body of natural language mathematics could help to equip machine learning 
models with a notion of what human mathematicians find interesting, and focus 
on these areas. 


Grounding Language Models. Autoformalization does not only produce formal 
objects as a desired outcome, it also serves the dual purpose to improve language 
models. Checking the models’ outputs and feeding back their correctness as a 
training signal would provide valuable grounding for their understanding. 
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Of course, the gap between formalized and informal mathematics is huge: 
it will likely require a considerable level of effort to automatically create high 
quality formalizations. Also, we believe that we will likely need a very high qual- 
ity theorem prover to bootstrap any autoformalization system. However, recent 
progress in neural language processing [9,42], unsupervised translation [30,43] 
and also neural network based symbolic mathematics [29,41,18,39] makes this 
path seem increasingly feasible and appealing in the long run. 


3 Conclusion 


In this extended abstract, we surveyed recent results in neural theorem proving 
and our mission to build an artificial mathematician, as well as some of the 
challenges on this path. While there is no guarantee that we can overcome these 
challenges, and there might be challenges that we cannot even anticipate yet, 
mere partial success to our mission could help the formal methods community 
with tools to simplify the formalization process, and impact adjacent areas, such 
as verification, program synthesis, and natural language understanding. 

In a 2018 survey among AI researchers, the median prediction for when ma- 
chines “routinely and autonomously prove mathematical theorems that are pub- 
lishable in top mathematics journals today, including generating the theorems 
to prove” was in the 2060s [16]. However, over the last years, deep learning has 
already beaten a lot of expectations (at least ours) as to what is possible in au- 
tomated reasoning. There are still several challenges to be solved, some of which 
we laid out in this abstract, but we believe that creating a truly intelligent ar- 
tificial mathematician is within reach and will happen on a much shorter time 
frame than many experts expect. 
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Abstract. Sentential Calculus with Identity (SCI) is an extension of 
classical propositional logic, featuring a new connective of identity be- 
tween formulas. In SCI two formulas are said to be identical if they share 
the same denotation. In the semantics of the logic, truth values are dis- 
tinguished from denotations, hence the identity connective is strictly 
stronger than classical equivalence. In this paper we present a sound, 
complete, and terminating algorithm deciding the satisfiability of SCI- 
formulas, based on labelled tableaux. To the best of our knowledge, it 
is the first implemented decision procedure for SCI which runs in NP, 
i.e., is complexity-optimal. The obtained complexity bound is a result of 
dividing derivation rules in the algorithm into two sets: decomposition 
and equality rules, whose interplay yields derivation trees with branches 
of polynomial length with respect to the size of the investigated formula. 
We describe an implementation of the procedure and compare its perfor- 
mance with implementations of other calculi for SCI (for which, however, 
the termination results were not established). We show possible refine- 
ments of our algorithm and discuss the possibility of extending it to other 
non-Fregean logics. 


Keywords: Sentential Calculus with Identity - non-Fregean logics - la- 
belled tableaux - decision procedure - termination - computational com- 
plexity. 


1 Introduction 


In this paper, we present a decision procedure for the non-Fregean sentential 
calculus with identity SCI. The contribution of the paper is twofold. First of 
all, this is the first implemented and complexity-optimal decision procedure for 
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SCI, although several deduction systems for SCI have already been presented in 
the literature. Second, our decision procedure is constructed in the paradigm of 
labelled tableaux, which makes the whole approach more robust to modifications 
and extensions to other non-Fregean logics. 

Non-Fregean logic is an alternative to both classical and many non-classical 
systems whose semantics identifies semantical correlates of sentences with their 
logical values. According to the classical approach in model theory, semanti- 
cal structures (realities) correspond to the language that is meant to describe 
them, and therefore, symbols and expressions of that language, such as individual 
constants or relational symbols, have their denotations in these structures (re- 
spectively, objects or relations between objects). However, sentences are treated 
differently, as they are interpreted in models only in terms of logical values or 
other semantical relations such as satisfaction or truth. This classical approach 
allows us to answer the very basic logical question of whether the sentences are 
logically equivalent; however, it does not provide any tool that would allow to 
check whether the sentences describe or refer to the same situation, or have the 
same meaning. Thus, the main motivation for non-Fregean logic was the need 
for an extensional and two-valued logic that could be used to represent seman- 
tical denotations of sentences that — depending on the underlying philosophical 
theory of language or the reality to which a logic is supposed to refer — could 
be understood as situations, states of affairs, meanings, etc. In order to express 
(non)identities or other interactions between the referents of sentences, at least 
the universe of denotations of sentences needs to be added to the semantics and 
the new identity connective to the language. 

The minimal two-valued non-Fregean propositional logic SCI (Sentential Cal- 
culus with Identity), introduced by Suszko (see [21]), is an extension of classical 
propositional logic with a new binary connective of identity (=) and axioms re- 
flecting its fundamental properties. The identity connective represents the iden- 
tity of the denotations of sentences, and so, an expression ‘@~ = 1p’ should be read 
as ‘the sentences ~ and p describe the same «thing»’. The semantics for SCI is 
based on structures determined by a universe of the denotations of sentences, a 
set of facts (those denotations that actually hold), and operations corresponding 
to all the connectives. The identity connective is then interpreted as an operation 
representing an equivalence relation that additionally satisfies the extensionality 
property. In the non-Fregean approach the identity and equivalence connectives 
are in general not equivalent: two sentences with the same truth value can have 
different denotations. Take, for instance, the following three statements: 


A ‘There is an effective method for determining whether an arbitrary formula 
of classical propositional logic is a theorem of that logic.’ 

B ‘Classical propositional logic is finitely axiomatizable, has a recursive set of 
recursive rules and enjoys the finite model property.’ 

C ‘Classical propositional logic is Post consistent.’ 


A, B, C are all (necessarily) true as theorems of mathematical logic. Therefore, 
they are pairwise logically equivalent, that is, all three equivalences: A © B, 
B & C, and A © C hold. One can fairly claim that A and B refer to the same 
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fact, so A = B, but C has clearly a different semantic correlate than both A and 
B, as decidability is independent of Post consistency. Thus, we have A # C and 
BEC. 

It is known that the class of all non-equivalent non-Fregean propositional 
logics satisfying the laws of classical logic is uncountable [7], and some of these 
logics are equivalent to the well-known non-classical logics (e.g., modal logics S4 
and S5, many-valued logics). Higher-order non-Fregean logics are very expres- 
sive. In particular, a logic obtained from SCI by adding propositional quantifiers 
is undecidable and can express many mathematical theories, e.g., Peano arith- 
metic, the theory of groups, rings, and fields [8]. Furthermore, non-classical and 
deviant modifications of SCI have been developed and extensively studied in 
the literature, in particular intuitionistic logics [17,14,4], modal and epistemic 
logics [15,16], logics with non-classical identity [13], paraconsistent [6,9]. The 
non-Fregean approach could turn out to be more adequate than the classical 
one in cognitive science or natural language processing. Moreover, non-Fregean 
logic could serve as a general framework for comparing different aspects of logics 
with incompatible languages and semantics and help in addressing the question 
of which class of logics handles logical symbols in the most adequate way from 
the perspective of natural language. 


In the original works by Suszko and Bloom the deduction system for SCI 
was defined in the Hilbert style [1,2]. Sound and complete deduction systems 
which are better suited for automated theorem proving were constructed later: 
Gentzen sequent calculi [18,22,23,3] and dual tableau systems [5,19,10]. A de- 
tailed presentation of all of them can be found in [10]. The main disadvantage of 
the aforementioned systems is that they are not decision procedures, while SCI 
is decidable and in particular in NP [2, Theorem 2.3]. Although the system by 
Wasilewska [22] can be seen as a meta-tool for deciding validity of SCl-formulas, 
it is equipped with external meta-machinery that is not a part of the system it- 
self. As a result, it constitutes another proof for decidability of SCI, rather than 
being a decision procedure in the classical sense of the term, that is suitable 
for computer implementations. In [11] a tableau-based algorithm for SCI was 
presented as a work-in-progress. The decision procedure presented in this paper 
is a result of a substantial remodelling of the preliminary system introduced 
in [11], for which we prove soundness and completeness, present surprisingly 
straightforward proofs of termination and membership in NP, and provide an 
implementation. 


In this paper, we present a new deduction system Tsc, for the logic SCI, based 
on labelled tableaux. To the best of our knowledge, it is the first decision pro- 
cedure for SCI. Moreover, its upper complexity bound, that is NP, matches the 
complexity class of the satisfiability problem for SCI, thus, making the algorithm 
complexity-optimal. Tscı is built in the paradigm of labelled tableaux. The lan- 
guage of deduction is an extension of the SCl-language with two sorts of labels 
representing the denotations of formulas (i.e., «facts» and «non-facts») as well 
as with the equality and the inequality relation that can hold between labels. 
(In)Equality formulas occurring in a derivation tree provide additional informa- 
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tion on identity or distinctness of the denotations of formulas. In Section 2, we 
provide a formal overview of the logic SCI, in Section 3, we introduce the tableau 
algorithm Tsc; and prove its soundness, completeness, and termination, estab- 
lish that it is complexity-optimal with respect to SCl-satisfiability, and show a 
possible refinement thereof. In Section 4, we discuss an implementation of Tsc 
and compare it with an older prover based on a heuristic, unproven algorithm. 
Conclusions and directions of further research are presented in Section 5. 


2 SCI 


Syntaz Let Lscı be a language of the logic SCI with the alphabet (AF, =~, =>, =), 
where AF = {p,q,r,...} is a denumerable set of atomic formulas. The set FOR 
of SCl-formulas is defined by the following abstract grammar: 


p:=p|-9|e>e|Pp=@, 
where p € AF. 


Aziomatization The logic SCI is axiomatized by the following set of truth- 
functional (1-3) and identity (4-8) axiom schemes: 


1. p> (p> @) 

2. (p > (hb > x) > (P > 4) > (9 > x)) 
3. Co > =) > (4 > @) 

4. p=E- 

5. @=Wp> 7g = 

6. p =Ņ} (x=8> (p > x) = (p > 8)) 
7. P= (x=0-> (p =x)= (Q = 9)) 

8. p= 4> (p =>) 


Semantics Let U # 0, D C U, and let & : U — U, 3 : U x U — U, 
and =: U x U —+ U be functions on U. An SCl-model is a structure M = 
(U, D, 5,5,2), where U and D are called, respectively, universe and set of 
designated values, and the following conditions are satisfied for all a,b € U: 


“a € D iff ag D (1) 
ab e D iff ag DorbeD (2) 
a=b € D iff a= b. (3) 


A valuation in an SCl- model M = (U, D,4,-4,=) is a function V : FOR — U 
such that for all »,1 € FOR it holds that V(=~@) = =V(@) and V(p#w) = 
V(@)#V (p), for # € {=,=}. An element a € U such that a = V (ọ) is called 
the denotation of @. Interestingly, SCl-model can be defined alternatively as a 
triple M = (U,D,V), where a valuation V : FOR —+ U needs to satisfy the 
conditions analogous to (1)-(3) (for instance, V(=@) € D iff V(@) ¢ D etc.). 
In the original approach V may as well be defined only for atomic formulas 
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and then lifted up homomorphically to the set of all formulas, like in classical 
propositional logic. In the latter setting it is not the case, as a valuation defined 
solely for atoms does usually not have a unique extension to all formulas. We 
say that a formula ọ is satisfied in an SCl-model M = (U, D, 5,5,2) anda 
valuation V in M, and refer to it as M, V sc @, if its denotation belongs to 
D. We call a formula ọ satisfiable if it is satisfied in some SCl-model by some 
valuation. We say that a formula @ is true in a model M = (U, D,7,-4,=), and 
refer to it as M Esc, ~, whenever it is satisfied in M by all the valuations in M. 
We call a formula @ valid, and refer to it as Esc @, if it is true in all SCl-models. 
Note that over the class of models where D and U\ D are singletons SCI collapses 
to classical propositional logic. In fact all formulas which are SCl-instances of 
formulas valid in classical propositional are also valid in SCI. It suffices, however, 
to take a three-element model to tell + and = apart, as shown in the following 
example. 


Example 1. Although the formula —=-p + p is a tautology of classical proposi- 
tional logic, the formula =7p = p is not valid in SCI. Indeed, consider an SCl- 
model M = (U, D, >, to, =), where U = {0,1,2}, D = {1,2}, and the operations 
a, >, = are defined by: 


0, if a A 2 and 0, ifa#b 
if b= i = 
i ae 0, tae, a 0, azb=}% ifa = b and 
1, otherwise. 2, ifa=b, a #0, 
1, otherwise. 1, otherwise. 


It is easy to verify that such a structure is an SCI-model. Then, the following 
hold: 


— 472 = 1, and so, M and a valuation V in M such that V (p) = 2 falsify the 
formula =~=p = p, 


142 = 1, but 42-451 = 2, and so, the formula (p > q) = (~q > 7p) is not 
true in M. 


What is also characteristic of SCI is that identical formulas can be inter- 
changed within other formulas with not only truth preservation, but also iden- 
tity preservation. For instance, if p = (p > q), then p = ((p > q) > q), 
p = (((p > q) > q) > q) and so on. On the other hand, identity of two formu- 
las does not automatically yield identity of their subformulas. For example, if 
ap = ~q, it does not necessarily mean that p = q. It is worth noting that in SCI 
we lack the usual equivalence between treating A, V, and + as abbreviations 
involving — and —> and treating them as independent connectives whose mutual 
relations are established axiomatically. For instance, when =(@ — ~p) is just a 
notational variant for @ Arp, then (g AW) = =(@ —> ~p) is, of course, SCl-valid; 
however, it would not be the case if we regarded ^ as a separate connective. Nev- 
ertheless, extending our results to other connectives introduced as independent 
logical constants is a matter of routine. 
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3 Tableaux 


In this section, we provide a characterization of a sound, complete and termi- 
nating labelled tableau system for the logic SCI, which we call Tsc. 

Let Lt, L~ be countably infinite disjoint sets and let L = L* U L7. We will 
call an expression w : @ a labelled formula, where w € L and ọ € FOR, and 
w will be called a label. We will abbreviate the set of all labelled formulas by 
LF. Any labels superscribed with ‘+’ are restricted to belong to Lt and labels 
superscribed with ‘—’ to belong to L~. Labels without a superscript are not 
restricted. Intuitively, w stands for the denotation of ọ in an intended model. 
Labels with ‘+’ in the superscript denote elements of D, whereas labels with 
superscribed ‘—’ represent elements of U \ D. Thus, expressions of the form w = v 
or w Æ v reflect, respectively, the equality or distinctness of two denotations. By 
Id*, Id~ we denote the sets of, respectively, all equalities and all inequalities of 
labels. Finally, we let Id = Id* Uld7. 

A tableau generated by the system for the logic SCI is a derivation tree whose 
nodes are assigned labelled formulas and (in)equality expressions. A simple path 
B from the root to a leaf in a tableau 7 is called branch of T. We will identify 
a branch 6 with the set of labelled formulas and (in)equalities occurring on B. 

The rules of our tableau system have the following general form: Be 
where ® is the set of premises and each W;, for i € {1,...,n}, is a set of 
conclusions. Intuitively, the ‘|’ symbol should be read as a meta-disjunction. A 
rule with only one set of conclusions is called a non-branching rule. A rule with 
several sets of conclusions is a branching rule. In Tsc, all rules where Y;, for 
i € {1,...,n} contain labelled formulas are called decomposition rules. All rules 
with a single equality statement as the conclusion are called equality rules. The 
remaining rules, in which | occurs as the conclusion, are referred to as closure 
rules. If we have a decomposition rule (R) with w : @ as its premise, then (R) is 
applicable to w : @ occurring on a branch $ if it has not been applied to w: © 
on B before. Otherwise w : @ is called (R)-expanded on B. For an equality rule 
(R) with ® as the set of premises and w = v as the conclusion, (R) is applicable 
to ® C B if w = v is not present on B. Otherwise M is (R)-expanded on B. 
Intuitively, if a set of premises ® is (R)-expanded on B, then applying (R) to ® 
would not add any new information to B. 

A branch B of a tableau 7 is extended by applying rules of the system to 
sets of labelled formulas and (in)equality statements that are already on B. A 
label w is present on B if there exists a formula @ such that w : @ occurs on B. 
Otherwise w is fresh on B. A branch B is called closed if one of the closure rules 
has been applied to it, that is, when an inconsistency occurs on B. A branch 
that is not closed, is open. A branch B is fully expanded if it is closed or no rules 
are applicable on it. A tableau 7 is called closed if all of its branches are closed. 
Otherwise 7 is called open. We call 7 fully expanded if all of its branches are 
fully expanded. 

Analytic tableaux are satisfiability checkers, so a tableau proof of a formula @ 
is a closed tableau with a labelled formula w7 : ọ at its root. A formula is tableau- 
valid if all tableaux with w7~ : @ at the root are closed. On the other hand, a 
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1 Labels occurring in conclusions of the rules: (=*), (=~), (37), (37), (=*), (=7) 


are fresh on the branch. 
? The abbreviation @ œ~ w represents the set of three preconditions: w : g, v : p, 
w = v, for some w,v € L. Similarly for x ~ 0. 


Fig. 1. Tableau system Tsci 


formula g is tableau-satisfiable if there exists an open and fully expanded tableau 
with a labelled formula wt : @ at its root. Note that our notion of tableau- 
satisfiability matches the usual notion of satisfiability as a failure of finding a 
proof. Indeed, if a formula ọ is not tableau-valid, that is, there exists a tableau 
with w7 : @ at the root which has an open branch, then 7@ is tableau-satisfiable. 
Thus, the standard duality between validity and satisfiability is reflected in the 
concepts of tableau-validity and tableau-satisfiability. 


3.1 Tableau System for SCI 


The rules presented in Figure 1 constitute the tableau system Tsc; for the logic 
SCI. The decomposition rules (=*), (=~), (47), (37), (=+), (=) reflect the 
semantics of =, + and = defined in the conditions 1-3 from Section 2. Note that 
an application of any of these rules introduces to a branch fresh labels for each of 
the subformulas into which the premise formula is decomposed. By that means, 
all occurrences of subformulas of the input formula @ are assigned their unique 
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labels. A few words of extra commentary on the rule (=~) are in order. It de- 
composes a formula involving the = connective, which is assumed to be false. By 
the semantics of = we know that the constituents of the initial =-formula have 
distinct denotations. If these denotations have different polarities, representing 
different truth values (disjuncts 2 and 3 in the denominator of the rule), then no 
additional information has to be stored about the distinctness of these denota- 
tions. If, on the other hand, the denotations have the same polarity, representing 
the same truth value (disjuncts 1 and 4 in the denominator of the rule), then 
extra information is added, namely that the denotations of both formulas are dis- 
tinct. The rules (=~), (=~) and (==) are tableau-counterparts of the axioms 5, 
6, and 7, respectively. The rule (F) ensures that a valuation that can be read 
off from an open branch is a function, i.e., that all denotations assigned to the 
same formula on a branch are equal. The rules (sym) and (tran) guarantee that 
equalities appearing on a branch preserve all properties of the =-relation. Note 
that an application of a closure rule to a branch is always a result of transforma- 
tions of equality statements. While executing Tscı we always apply closure rules 
eagerly, that is, whenever a closure rule can be applied, it should be applied. 
An example of a tableau proof generated by Tscı can be found in Figure 2. 
The tableau system Tsc is a user- ae 

friendly and elegant solution to the = SOY) 


problem most non-labelled systems | (=>) 
for SCI struggle with, namely substi- vt: p= 
tutability of identical formulas within u-:@ > 
other formulas with identity preser- los 
vation. In a derivation that can re- 

xt :@ 


sult in yielding conclusions of greater B 
complexity than premises, as shown y yp 


at the end of Section 2. It often leads +. —. 
to a loss of subformula property in a Z r ? =. : a 
deduction system. Tsc, on the other i y Ep 
hand, reduces the whole reasoning to 2 mt Z =t 
a simple equality calculus where only (F) | | (F) 
identities or non-identities between la- yT =tt IS 
bels are substantial for the result of | 

; een : (12) (L2) 
a given derivation. It allows us to cir- 


cumvent the abovementioned problem L L 

by replacing it with a question: are la- 

bels representing given formulas equal Fig. 2. Tableau proof for the axiom ọ = 
or distinct? Yp > (p => p) 


3.2 Soundness and Completeness“ 


First, we will prove soundness of the tableau system Tscı. 


4 A technical appendix to the paper with all omitted proofs can be found in [12] 
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Let A, B be finite sets such that A C LF and B C ld. A set AU B is said to 
be satisfied in an SCl-model M = (U, D, 5,5,2) by a valuation V in M anda 
function f : L — U if and only if the following hold: (1) V(@) = f(w), for all 
w € Land ọ € FOR such that w :  € A, (2) f(w) € D iff w € L*, for all labels 
w that occur in AU B, (3) f(w) = f(v), for all w,v € L such that w = v € B, 
(4) f(w) A f(v), for all w,v € L such that w 4 v € B. A set AU B is said 
to be SCl-satisfiable whenever there exist an SCl-model M = (U, D, 5,5, 5), a 
valuation V in M, and a function f : L —> U such that AU B is satisfied in M 
by V and f. 


Proposition 1. For every satisfiable SCl-formula @ and for all wt € Lt it 
holds that {w* : p} is SCl-satisfiable. 


Proposition 2. For all w,v € L, wt € L+, and v™ € L`, and for all finite 
X CLFUId, the sets X U{w = v,w 4 v} and X U {wt = v7} are not SCl- 
satisfiable. 


Let (R) To for n > 1, be a decomposition or equality rule of the tableau 
system Tscı. A rule (R) is referred to as sound whenever, for every finite set 
X C LFUld, it holds that X U® is SCl-satisfiable iff X U P UY; is SCl-satisfiable 
for some 7 € {1,...,n}. 


Proposition 3. Decomposition and equality rules of the tableau system Tsc are 
sound. 


Theorem 1 (Soundness). The tableau system Tsc is sound, that is, if an SCI 
formula ọ is satisfiable, then ọ is tableau-satisfiable. 


Proof. We prove the contrapositive. Let 7 be a closed Tsc-tableau with wt : @ 
at its root. Then, each branch of T contains either wt = v~ or both w = v 
and w Æ v, for some w,v € L, wt € Lt, v™ € L7. By Proposition 2, both 
sets X U {w = v`} and X U {w = v,w Æ v} are not SCl-satisfiable, for any 
finite set X C LFU Id. By Proposition 3, each application of Tscj-rules preserves 
SCl-satisfiability. Hence, going from the bottom to the top of the tree 7, on each 
step of the construction of Tscj-tableau we get SCl-unsatisfiable sets. Thus, we 
can conclude that wt : @ is not SCl-satisfiable, and thus by Proposition 1 we 
obtain that ọ is not SCI-satisfiable. Therefore, each satisfiable SCl-formula @ is 
tableau-satisfiable. 


To prove completeness of the system Tsc; we need to show that if, for a given 
formula @~, Tscı does not yield a tableau proof, then ọ is not valid, i.e., there 
exists a countermodel M = (U, D, V} such that M j 9. 

Suppose that we want to obtain a tableau-proof for a formula @. To that 
end, we run the Tscı-tableau algorithm with a labelled formula w~ : @ at the 
root of the tableau, for w~ € L~. Suppose that it yields an open tableau as a 
result. It means that the tableau contains an open and fully expanded branch 
B. We will demonstrate how to construct a structure Mg = (U, D, 5,5,5) 
using information stored on B and show that it actually is an SCl-countermodel 
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falsifying @. Let ES be the set of all labels superscribed with ‘+’ occurring on 
B, let Lg be the set of all labels superscribed with ‘—’ occurring on B and 
let Lg = L$ U Lg. Moreover, let FORg be the set of all SCl-formulas such 
that w : @ occurs on B, for some w € Lg. Note that all elements of FORg are 
subformulas of @. Before we characterize the construction of Mg, we define a 
binary relation ~C Lg x Lg in the following way: 


wru iff w =v occurs on B. 
Proposition 4. The relation ~ is an equivalence relation and (L$ xLg)N~ = 0. 


Let MLE be a set resulting from choosing exactly one label from each element 
of (Lg)/~. Sets MLg and MLg are defined analogically with the assumption that 
w` € MLg, where w` is such that w~ : ¢ is at the root of an open tableau. Of 
course, neither of these sets is uniquely determined. 


Proposition 5. For all p € FOR and w,v E Lg the following holds: 
if both w: p and v : belong to B, then w ~ v. 


We say that w € MLg is (~)-closed whenever there are p € FOR, u € MLg, 
and v,t € Lg such that w ~ v, u ~ t and labelled formulas v : p, t : ~p belong 
to B. Let w,v € MLg and # € {—, =}. The pair (w,v) is said to be (#)-closed 
whenever there exist p,0 € FOR, u € MLg, and t,2,y € Lg such that w ~ t, 
v ~ gx, u ~ y and labelled formulas t : p, æ : 0, y : (#9) occur on the branch B. 


The branch structure Mg = (U, D, 5,5, =) is defined as follows: 


— D = {w7 | wt € MLZ} U {wt}, where wt ¢ Lg 
=0=DUML, 


It follows from the above that U \ D = ML}. The operations 4, +, = are defined 
for all w,v € U in the following way: 


u € U, if there are p € FOR and v,t € Lg such that w = v, u = t, 
zwy of v: 1p, and t: —) are on B 
i T if w is not (~)-closed and w ¢ D 
otherwise 


3 


v=a2,u=y,t:p, 2:0, and y: (p > 8) are on B 
wt, if v = wt or both (w = wt and v € D), or it holds that 


(w, v) is not (—)-closed and either w ¢ D or v € D 


w 
w 
| € U, if there are ,0 € FOR and t,2,y € Lg such that w = t, 
~ df 
wv = 
w7, otherwise 


u €U, if there are p,0 € FOR and t,z,y € Lg such that w = t, 
v=2,u=y,t:W, «2:0, and y: (p = 9) are on B 


wav £2 wt, if w = v and either w = w* or the pair (w,v) is not 
(=)-closed 
w` otherwise 


? 


Due to the properties of the sets MLE and ML,, we obtain: 
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Proposition 6. The sets D and U \ D are non-empty and DN (U \ D) = 9. 


The following series of results ensure that the operations 5, >, and = reflect 
the semantics of SCI. 


Proposition 7. = is a function on U and for all w € U: 
(x) aw € D iffw ¢ D. 
Proposition 8. > is a function on U and for all w,v € U, the following holds: 
(x) wu € D iffw gD orve D. 
Proposition 9. = is a function on U and for all w,v € U the following holds: 
(*) wevu € D iffw=v. 
Propositions 6-9 imply: 
Proposition 10. The structure Mg is an SCl-model. 
In what follows, the structure Mg will be referred to as branch model. 
Now, let V : FOR —> U be a function such that for all p € AF: 


V(p) bi if there is w € Lg such that w : p € B and w ~ u 
pj = 


wt, otherwise 
and for all Y, O € FOR the following hold: 


Vp) =V) 
V(p#0) = V (p)#V (0), for # € {>,5}. 


Proposition 11. The function V is well defined and it is a valuation in Mg. 
Proposition 12. For all p € FOR and w E Lg it holds that: 
(*) fw: EB, then w~ Vp). 


Theorem 2 (Completeness). The tableau system Tscı is complete, that is, if 
a formula @ is SCl-valid, then @ has a tableau proof. 


Proof. Let @ be a valid SCl-formula. Suppose that @ does not have a tableau 
proof. Then, each Tscj-tableau with w7 : @ at its root is open. Let 5 be an open 
and fully expanded branch of an open tableau for w~ : @~. By Proposition 10, 
the structure Mg = (U, D, 5,5,5) is an SCl-model. Let V be a valuation in 
Meg defined as before Proposition 11 Then, by Proposition 12, w~ ~ V(@), and 
hence V(@) ¢ D. Thus, ọ is not true in Mg, which contradicts the assumption 
that @ is SCl-valid. 
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3.3 Termination 


It turns out that the system presented in Section 3.1 terminates without any 
external blocking mechanisms involved which would impose some additional re- 
strictions on rule-application. The only caveat that has to be added to the system 
is the one that we have already expressed, namely that no rule (R) can be applied 
to the set of premises that is (R)-expanded. 


Theorem 3. The tableau system Tsc is terminating. 


Proof. The argument hinges on two observations. First, the decomposition rules 
are the only rules that introduce fresh labels to a branch $ of a Tsc)-tableau 
T, and, as mentioned before, on a branch B each occurrence of a subformula of 
the initial formula @ is assigned its unique label. Thus, since an application of 
any of the above rules decreases the complexity of the processed formula and 
the rule cannot be applied twice to the same premise, the total number of labels 
occurring on a branch does not exceed the size of @ measured as the number of 
all occurrences of subformulas of @ (henceforth denoted by ||). Secondly, the 
equality rules can only add equalities between labels to a branch, provided that 
such an equality statement is not already present thereon. The maximal number 
of such equalities is quadriatic in the total number of labels occurring on a 
branch. Thus, for each SCl-formula ~, on any branch B of a Tscj-tableau for ọ, 
rules are applied at most |~|+|@|? +1 times, where ‘1’ in the formula represents 
an application of a closure rule. This makes the whole derivation finite. 


Corollary 1. For each SCl-formula ọ every branch B of a Tsc\-tableau deriva- 
tion for ọ is of polynomial size with respect to the size of @. 


Since SCI contains classical propositional logic, it inherits the NP-lower bound 
for the satisfiability problem therefrom. Together with membership of SCl-satis- 
fiability in NP it gives the following: 


Theorem 4. Tsc is a complexity-optimal decision procedure for the NP-com- 
plete problem of SCl-satisfiability. 


Proof. Immediate from Corollary 1 and the fact that each branching rule of Tsc 
is finitely branching. 


3.4 Limiting the Number of Labels 


To boost the performance of the system Tscı we propose a refinement thereof. It 
consists in limiting the number of fresh labels introduced to a tableau by decom- 
position rules by introducing an additional condition called urfather blocking 

Given a formula ọ for which we construct a Tsc)-tableau 7, for each sub- 
formula w of @, let’s call the first occurrence of a labelled formula w : p on a 
branch B of T the ip-urfather on B. The system Tsc; + (UB) (tableau system for 
SCI with urfather blocking) is composed of the rules of Tsc; and an additional 
constraint: 
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(UB) For each labelled formula w : @ that occurs on a branch B, no decompo- 
sition rule can be applied to w : ọ unless it is the @-urfather on B. 


It turns out that augmenting Tscı with (UB) does not lead to any unwanted 
consequences such as giving up the completeness. 


Proposition 13. For every SCl-formula ọ, if @ has a Tsc\-tableau proof, then 
@ has TCsc; + (UB)-tableau proof. 


Theorem 5. Tsc; + (UB) is sound, complete, terminating, and complexity-op- 
timal for SCl-satisfiability. 


Proof. The soundness of Tsc; + (UB) straightforwardly follows from the sound- 
ness of Tscı and the fact that both systems share the full set of rules. The argu- 
ment for termination of Tscı + (UB) and complexity-optimality of Tscı + (UB) 
for SCl-satisfiability goes along the same lines as the proofs of Theorems 3 and 4, 
and rests on the fact that, for each formula ọ, a Tscı + (UB)-tableau contains 
at most as many labels as a Tscj-tableau. The completeness of 

TSCI + (UB) is a direct consequence of Proposition 13 and Theorem 2. 


4 Implementation 


4.1 Overview 


We have written proof-of-concept type implementations of the labelled tableau 
system described in the present article and its variant with urfather blocking, as 
well as a dual-tableau-based theorem prover for SCI based on the system from [5]. 
Since the last system does not enjoy the termination property, the implemen- 
tation relies on heuristics in this respect. All three provers are implemented in 
the Haskell language using similar programming techniques in a casual manner, 
without any serious attempt to optimize the code or to test it extensively, as the 
programs are only intended as temporary aids to ongoing research. 

In testing, the labelled-tableau provers turned out to need drastically more 
computing resources even in many quite modest test cases. For instance, the 
axiom ((p = q) A (r = s)) > ((p = r) = (q = s)) generates a labelled tableau of 
depth 37 consisting of 619 nodes, which urfather blocking reduces to depth 33 
and 555 nodes, while the tree of the dual-tableau prover has depth 18 and only 
67 nodes. The difference appears to be mostly due to the large branching factor of 
the identity rules of the labelled-tableau system. However, in some test cases the 
labelled-tableau system yields a smaller tree than the other prover. In general, 
the labelled tableau method seems to tolerate relatively well formulas consisting 
of a large number of very simple identitities. 


4.2 Technical Notes 


Unlike the abstract tree described above, each node of which contains only a 
single labelled formula, each node of the tree built by the program contains a 
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list of all the labelled formulas encountered so far on the branch. This allows the 
program to freely manipulate the list to keep track of what rules have already 
been applied to which formulas. There are three main types of nodes: normal 
nodes, identity nodes, and leaves. First, the decomposition rules are applied in 
normal nodes. Once they have been applied to exhaustion, the tree is extended 
with identity nodes, in which the identity rules are applied. At any point, one 
of the closure rules (L1) or (12) can be applied to append a special closure leaf 
node. An open leaf node is appended whenever there are no more rules to apply 
in an identity node and the branch remains open. 


4.3 Test Results 


We found a randomly generated provable SCl-formula that turned out to be 
somewhat challenging to an earlier prover. The formula, which we will call the 
@ here, looks as follows: 


(((q =p) > (p > r)) = ((p > (p © p)) =p) 


> (((rA p) © (p = p)) V ((p ^ p) V 7q)) 


We denote by p the formula obtained by replacing each occurrence of p in @ 
by ọ itself. We defined a provability-preserving transformation T that turns an 
SCl-formula into a Horn clause consisting of very simple identities. 

We present the results of attempting to prove the formulas ©, =@, p, =op, 
T(@), and T(7@). These are chosen to illustrate some of the variety of outcomes 
we observed. As noted above, @ is provable, and therefore also tp and T(@) are 
provable. The results are of the form depth/size, where depth is the maximal 
branch length and size is the number of nodes in the entire tree. There are 
entries for the dual-tableau-based prover (DTsci), the current labelled-tableau 
prover (Tsci), and the same with the urfather blocking condition (Tsc + (UB)). 
Several entries are missing due to exhaustion of memory (the programs were 
tested on a machine with 8GB of RAM; adding several gigabytes of swap space 
did not make a difference). 


Formula DTsci Tsc Tscı + (UB) 
depth size | depth size | depth size 
0) 27 299 37 4724 32 4659 
~op 12 42 202 111539 106 95724 
w 61 17729 — — 46 3023804 
np 42 602 — = = = 
T(@) — — 143 40230 106 34158 
T(=o) — — 529 52789 490 46153 


5 Conclusions 


In this paper we introduced the system Tscı which is the first complexity-optimal 
decision procedure for the logic SCI devised in the paradigm of labelled tableaux. 
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Tsc is conceptually simple and directly reflects the semantics of the logic. The 
reasoning performed in Tsc; has two components: decomposition and equality 
reasoning. Interestingly, it is the latter that is responsible for closing tableau 
branches, and thus, yielding tableau proofs for formulas. In this respect Tsc 
is based on similar conceptual foundations as calculi generated by the tableau- 
synthesis framework from [20].We provided an implementation of Tscı and a 
variant with urfather blocking, and we compared their performance with the 
performance of another implemented deduction system for SCI which has not 
been proven to be terminating or complete. There was no unique winner; the new 
system was better at dealing with formulas with complex networks of identities, 
while the old, unproven system handled other types of formulas better. Urfather 
blocking yielded modest reductions in depth and total size. 

In future research we want to address three main problems. First, we would 
like to optimize our tableau algorithm by introducing further refinements to it, 
such as decreasing the branching factor of the rule (>*) and, by that means, 
making it “information-deleting”. Some prelimiary results on the implementa- 
tion of Tsc) with the modified rule (+*) show a promising reduction of the 
size of generated tableaus. Moreover, we plan to search for heuristics and rule- 
application strategies which would, too, allow to minimize the size of tableaux 
yielded by Tscı for certain classes of formulas. It seems that it is not always 
necessary to fully decompose the input formula before performing any equality 
reasoning, if a contradiction is to be reached on a branch. Secondly, we would like 
to develop the dual-tableau systems from [5] and [10] to full-fledged decision pro- 
cedures, implement them, and compare the performance of all three algorithms 
on an extensive set of various SCl-formulas. Thirdly, we intend to extend the 
labelled tableaux-based approach presented in this paper to other non-Fregean 
logics, both classical (such as modal non-Fregean logics) and deviant (such as in- 
tuitionistic or many-valued non-Fregean logics, or Grzegorczyk’s logic). Finally, 
we would like to take a closer look at various normal forms of SCI formulas, one 
of which was mentioned in Section 4, and decide in what cases it pays off to 
transform a formula into a normal form before running a decision procedure, 
rather than running it directly on the initial formula. 
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Abstract. The material presented in this paper contributes to estab- 
lishing a basis deemed essential for substantial progress in Automated 
Deduction. It identifies and studies global features in selected problems 
and their proofs which offer the potential of guiding proof search in a 
more direct way. The studied problems are of the wide-spread form of “ax- 
iom(s) and rule(s) imply goal(s)”. The features include the well-known 
concept of lemmas. For their elaboration both human and automated 
proofs of selected theorems are taken into a close comparative considera- 
tion. The study at the same time accounts for a coherent and comprehen- 
sive formal reconstruction of historical work by Lukasiewicz, Meredith 
and others. First experiments resulting from the study indicate novel 
ways of lemma generation to supplement automated first-order provers 
of various families, strengthening in particular their ability to find short 
proofs. 


1 Introduction 


Research in Automated Deduction, also known as Automated Theorem Proving 
(ATP), has resulted in systems with a remarkable performance. Yet, deep math- 
ematical theorems or otherwise complex statements still withstand any of the 
systems’ attempts to find a proof. The present paper is motivated by the thesis 
that the reason for the failure in more complex problems lies in the local orient- 
edness of all our current methods for proof search like resolution or connection 
calculi in use. 

In order to find out more global features for directing proof search we start 
out here to study the structures of proofs for complex formulas in some detail 
and compare human proofs with those generated by systems. Complex formulas 
of this kind have been considered by Lukasiewicz in [19]. They are complex in the 
sense that current systems require tens of thousands or even millions of search 
steps for finding a proof if any, although the length of the formulas is very short 
indeed. How come that Lukasiewicz found proofs for those formulas although 
he could never carry out more than, say, a few hundred search steps by hand? 
Which global strategies guided him in finding those proofs? Could we discover 
such strategies from the formulas’ global features? 

By studying the proofs in detail we hope to come closer to answers to those 
questions. Thus it is proofs, rather than just formulas or clauses as usually in 


© The Author(s) 2021 
A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 58-75, 2021. 
https: / /doi.org/10.1007/978-3-030-79876-5_ 4 


Investigations into Proof Structures 59 


ATP, which is in the focus of our study. In a sense we are aiming at an ATP- 
oriented part of Proof Theory, a discipline usually pursued in Logic yet under 
quite different aspects. This meta-level perspective has rarely been taken in ATP 
for which reason we cannot rely on the existing conceptual basis of ATP but have 
to build an extensive conceptual basis for such a study more or less from scratch. 

This investigation thus analyzes structures of, and operations on, proofs for 
formulas of the form “axiom(s) and rule(s) imply goal(s)”. It renders condensed 
detachment, a logical rule historically introduced in the course of studying these 
complex proofs, as a restricted form of the Connection Method (CM) in ATP. All 
this is pursued with the goal of enhancing proof search in ATP in mind. As noted, 
our investigations are guided by a close inspection into proofs by Lukasiewicz 
and Meredith. In fact, the work presented here amounts at the same time to a 
very detailed reconstruction of those historical proofs. 

The rest of the paper is organized as follows: In Sect. 2 we introduce the 
problem and a formal human proof that guides our investigations and compare 
different views on proof structures. We then reconstruct in Sect. 3 the historical 
method of condensed detachment in a novel way as a restricted variation of the 
CM where proof structures are represented as terms. This is followed in Sect. 4 by 
results on reducing the size of such proof terms for application in proof shortening 
and restricting the proof search space. Section 5 presents a detailed feature table 
for the investigated human proof, and Sect. 6 shows first experiments where the 
features and new techniques are used to supplement the inputs of ATP systems 
with lemmas. Section 7 concludes the paper. Supplementary technical material 
including proofs is provided in the report [37]. Data and tools to reproduce the 
experiments are available at http: //cs.christophwernhard.com/cd. 


2 Relating Formal Human Proofs with ATP Proofs 


In 1948 Jan Lukasiewicz published a formal proof of the completeness of his 
shortest single axiom for the implicational fragment (IF), that is, classical propo- 
sitional logic with implication as the only logic operator [19]. In his notation the 
implication p > q is written as Cpq. Following Frank Pfenning [27] we formal- 
ize IF on the meta-level in the first-order setting of modern ATP with a single 
unary predicate P to be interpreted as something like “provable” and represent 
the propositional formulas by terms using the binary function symbol i for im- 
plication. We will be concerned with the following formulas. 


Nickname [28][29, p. 319] Lukasiewicz’s notation First-order representation 


Simp CpCap Vpq P(i(p, igp)) 

Peirce CCCpapp Vpq P(i(i(ipg), p), p) 

Syll CCpqCCqrCpr Ypqr P(i(ipq, i(igr, ipr))) 

Syll Simp CCCpqrCqr Vpqr Pi(i(ipq, r), iqr) 
Łukasiewicz CCCparCCrpCsp Ypqrs P(i(i(ipg, r), i(irp, isp))) 


IF can be axiomatized by the set of the three axioms Simp, Peirce and Syll, 
known as Tarski-Bernays Axioms. Alfred Tarski in 1925 raised the problem to 
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4 


Pi(i(ipg, r), i(irp, isp)) A (Px A Pixy > Py) —> Pi(ipq, i(iqr, ipr)) 
~ 
3 
Fig. 1. EDS along with its five unifiable connections. 


characterize IF by a single axiom and solved it with very long axioms, which led 
to a search for the shortest single axiom, which was found with the axiom nick- 
named after him in 1936 by Lukasiewicz [19]. In 1948 he published his derivation 
that Lukasiewicz entails the three Tarski-Bernays Axioms, expressed formally by 
the method of substitution and detachment. Detachment is also familiar as modus 
ponens. Lukasiewicz’s proof involves 34 applications of detachment. Among the 
Tarski-Bernays axioms Syll is by far the most challenging to prove, hence his 
proof centers around the proof of Syll, with Peirce and Simp spinning off as 
side results. Carew A. Meredith presented in [24] a “very slight abridgement” of 
Lukasiewicz’s proof, expressed in his framework of condensed detachment [28], 
where the performed substitutions are no longer explicitly presented but implic- 
itly assumed through unification. Meredith’s proof involves only 33 applications 
of detachment. In our first-order setting, detachment can be modeled with the 
following meta-level axiom. 


Det © Vay (Pa A Pizy > Py). 


In Det the atom Pz is called the minor premise, Pixy the major premise, and 
Py the conclusion. Let us now focus on the following particular formula. 


EDS © Lukasiewicz ^A Det — Syll. 


“Problem EDS” is then the problem of determining the validity of the first order 
formula EDS. In view of the CM [1,2,3], a formula is valid if there is a spanning 
and complementary set of connections in it. In Fig. 1 ZDS is presented again, 
nicknames dereferenced and quantifiers omitted as usual in ATP, with the five 
unifiable connections in it. Observe that p,q,r,s on the left side of the main 
implication are variables, while p, q,r on the right side are Skolem constants. Any 
CM proof of EDS consists of a number of instances of the five shown connections. 
Meredith’s proof, for example, corresponds to 491 instances of Det, each linked 
with three instances of its five incident connections. 

Figure 2 compares different representations of a short formal proof with 
the Det meta axiom. There is a single axiom, Syll Simp, and the theorem 
is Vpgrstu Pi(p, i(q, i(r, i(s, i(t, ius))))). Figure 2a shows the structure of a CM 
proof. It involves seven instances of Det, shown in columns D,,...,D7. The 
major premise Pix;y; is displayed there on top of the minor premise Px;, and 
the (negated) conclusion —Py;, where xi, yi are variables. Instances of the ax- 
iom appear as literals =Pa;, with a; a shorthand for the term i(i(ip;q, ri), iqiri). 
The rightmost literal Pg is a shorthand for the Skolemized theorem. The clause 
instances are linked through edges representing connection instances. The edge 


Investigations into Proof Structures 61 


(a) Dy Dy Ds Da Ds D2 Di 
Aai Aa Aa: A i A "UN ‘ 
=Paz — Pizzy7 7Pag — Pirgye Pag — Pizsys —Pag — Pixays Piz3y3 =Pa, — Piz2y2 Pixiyı 


5 2 
=Pag — Paz 4 


Pag —Pas = Pars Pax 
oPyz i he oP ys oe 
3 


P 


2 
T3 =Pag EA Px Pay 
1 
ae lm ~ Pa 
3 


(b) Dı 
a a 
D2 Ds (c) 1. CCCOpqrCqr 
a N Z V 2. CpCqp = D11 
Ai A2 Da De 3. CpCqCrp = D12 
sy ye = es * 4. CpCqCrCsCtCus = D2D33 
A3 Ds Ae 


Aa As Az As 


2 3 2 3 
5} 5} 
4 5 2 3 4 2 
I I 3 3 T : 3 s 
4 3 4 3 4 
3 
I i] @ 
4 5 4 5 
ifi 1/1 


Fig. 2. A proof in different representations. 


labels identify the respective connections as in Fig. 1. An actual connection proof 
is obtained by supplementing this structure with a substitution under which all 
pairs of literals related through a connection instance become complementary. 


Figure 2b represents the tree implicit in the CM proof. Its inner nodes corre- 
spond to the instances of Det, and its leaf nodes to the instances of the axiom. 
Edges appear ordered to the effect that those originating in a major premise of 
Det are directed to the left and those from a minor premise to the right. The 
goal clause Pg is dropped. The resulting tree is a full binary tree, i.e., a binary 
tree where each node has 0 or 2 children. We observe that the ordering of the 
children makes the connection labeling redundant as it directly corresponds to 
the tree structure. 


Figure 2c presents the proof in Meredith’s notation. Each line shows a for- 
mula, line 1 the axiom and lines 2—4 derived formulas, with proofs annotated in 
the last column. Proofs are written as terms in Polish notation with the binary 
function symbol D for detachment where the subproofs of the major and minor 
premise are supplied as first and second, resp., argument. Formula 4, for exam- 
ple, is obtained as conclusion of Det applied to formula 2 as major premise and 
as minor premise another formula that is not made explicit in the presentation, 
namely the conclusion of Det applied to formula 3 as both, major and minor, 
premises. An asterisk marks the goal theorem. 
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Figure 2d is like Fig. 2b, but 
with a different labeling: Node 
labels now refer to the line in 
Fig. 2c that corresponds to the 
subproof rooted at the node. 
The blank node represents the 
mentioned subproof of the for- 
mula that is not made explicit in 
Fig. 2b. An inner node represents 
a condensed detachment step ap- 
plied to the subproof of the ma- 
jor premise (left child) and minor 
premise (right child). 

Figure 2e shows a DAG (di- 
rected acyclic graph) representa- 
tion of Figure 2d. It is the unique 
maximally factored DAG repre- 
sentation of the tree, i.e., it has 
no multiple occurrences of the 
same subtree. Each of the four 
proof line labels of Fig. 2c appears exactly once in the DAG. 

We conclude this introductory section with reproducing Meredith’s refine- 
ment of Lukasiewicz’s completeness proof in Fig. 3, taken from [24]. Since we 
will often refer to this proof, we call it MER. There is a single axiom (1), which is 
Lukasiewicz. The proven theorems are Syll (17), Peirce (18) and Simp (19). In 
addition to line numbers also the symbol n appears in some of the proof terms. 
Its meaning will be explained later on in the context of Def. 19. For now, we can 
read n just as “1”. Dots are used in the Polish notation to disambiguate numeric 
identifiers with more than a single digit. 


. CCCpqrCCrpCsp 

. CCCpqpCrp = DDD1D111n 

. CCCpqrCqr = DDD1D1D121n 

. CpCCpqCrq = D31 

. CCCpqCrsCCCqtsCrs = DDD1D1D1D141n 
. CCCpqCrsCCpsCrs = D51 

. CCpCarCCCpsrCqr = D64 

. CCCCCpaqrtCspCCrpCsp = D71 

. CCpqCpq = D83 

. CCCCrpCtpCCCpqrsCuCCCpqrs = D18 
. CCCCpqrCsqCCCq¢tsCpq = DD10.10.n 

. CCCCpqrCsqCCCqtpCsq = D5.11 

. CCCCpqrsCCsqCpq = D12.6 

. CCCpqrCCrpp = D12.9 

. CpCCpqq = D3.14 

16. CCpqCCCprqq = D6.15 

*17. CCpqCCqrCpr = DD13.D16.16.13 

*18. CCCpqpp = D14.9 

*19. CpCqp = D33 


Fig. 3. Proof MER, Meredith’s refinement [24] of 
Lukasiewicz’s proof [19]. 
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3 Condensed Detachment and a Formal Basis 


Following [4], the idea of condensed detachment can be described as follows: 
Given premises F — G and H, we can conclude G”, where G” is the most general 
result that can be obtained by using a substitution instance H’ as minor premise 
with the substitution instance F’ —+ G” as major premise in modus ponens. 
Condensed detachment was introduced by Meredith in the mid-1950s as an evo- 
lution of the earlier method of substitution and detachment, where the involved 
substitutions were explicitly given. The original presentations of condensed de- 
tachment are informal by means of examples [28,17,29,25], formal specifications 
have been given later [16,13,4]. In ATP, the rendering of condensed detachment 
by hyperresolution with the clausal form of axiom Det is so far the prevalent 
view. As overviewed in [23,31], many of the early successes of ATP were based 
on condensed detachment. Starting from the hyperresolution view, structural as- 
pects of condensed detachment have been considered by Robert Veroff [34] with 
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the use of term representations of proofs and linked resolution. Results of ATP 
systems on deriving the Tarski-Bernays axioms from Lukasiewicz are reported 
in [27,39,22,23,11]. Our goal in this section is to provide a formal framework 
that makes the achievements of condensed detachment accessible from a mod- 
ern ATP view. In particular, the incorporation of unification, the interplay of 
nested structures with explicitly and implicitly associated formulas, sharing of 
structures through lemmas, and the availability of proof structures as terms. 
Our notation follows common practice [6] (e.g., s > t expresses that t sub- 
sumes s, and s © ¢ that t is a subterm of s) with some additions [37]. For 
formulas F we write the universal closure as VF, and for terms s,t,u we use 
s|t > u] to denote s after simultaneously replacing all occurrences of t with u. 


3.1 Proof Structures: D-Terms, Tree Size and Compacted Size 


In this section we consider only the purely structural aspects of condensed de- 
tachment proofs. Emphasis is on a twofold view on the proof structure, as a tree 
and as a DAG (directed acyclic graph), which factorizes multiple occurrences 
of the same subtree. Both representation forms are useful: the compacted DAG 
form captures that lemmas can be repeatedly used in a proof, whereas the tree 
form facilitates to specify properties in an inductive manner. We call the tree 
representation of proofs by terms with the binary function symbol D D-terms. 


Definition 1. (i) We assume a distinguished set of symbols called primitive 
D-terms. (ii) A D-term is inductively specified as follows: (1.) Any primitive 
D-term is a D-term. (2.) If d, and dz are D-terms, then D(d,d2) is a D-term. 
(iii) The set of primitive D-terms occurring in a D-term d is denoted by Prim(d). 
(iv) The set of all D-terms that are not primitive is denoted by D. 


A D-term is a full binary tree (i.e, a binary tree in which every node has either 0 
or 2 children), where the leaves are labeled with symbols, i.e., primitive D-terms. 
An example D-term is 


d = D(D(1, 1), D(D(1, D(1, 1)), D(1, D(1, 1)))), (i) 


which represents the structure of the proof shown in Fig. 2 and can be visualized 
by the full binary tree of Fig. 2d after removing all labels with exception of the 
leaf labels. The proof annotations in Fig. 2c and Fig. 3 are D-terms written in 
Polish notation. The expression D2D33 in line 4 of Fig. 2, for example, stands 
for the D-term D(2, D(3,3)). Prim(D(2, D(3,3))) = {2,3}. 

A finite tree and, more generally, a finite set of finite trees can be represented 
as DAG, where each node in the DAG corresponds to a subtree of a tree in the 
given set. It is well known that there is a unique minimal such DAG, which 
is maximally factored (it has no multiple occurrences of the same subtree) or, 
equivalently, is minimal with respect to the number of nodes, and, moreover, 
can be computed in linear time |7]. The number of nodes of the minimal DAG 
is the number of distinct subtrees of the members of the set of trees. There are 
two useful notions of measuring the size of a D-term, based directly on its tree 
representation and based on its minimal DAG, respectively. 
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Definition 2. (i) The tree size of a D-term d, in symbols t-size(d), is the number 
of occurrences of the function symbol D in d. (ii) The compacted size of a D-term 
d is defined as c-size(d) “ {e €D |d e}}. (iii) The compacted size of a finite 
set D of D-terms is defined as c-size(D) & |{e€ D| d€ D and dœ e}|. 


The tree size of a D-term can equivalently be characterized as the number of 
its inner nodes. The compacted size of a D-term is the number of its distinct 
compound subterms. It can equivalently be characterized as the number of the 
inner nodes of its minimal DAG. As an example consider the D-term d defined 
in formula (i), whose minimal DAG is shown in Fig. 2e. The tree size of d is 
t-size(d) = 7 and the compacted size of d is c-size(d) = 4, corresponding to 
the cardinality of the set {e € D | d & e} of compound subterms of d, i.e., 
{D(1,1), D(1,D(1,1)), D(D(1, D(1, 1)), DG, DM, 1))), d}. 

As will be explicated in more detail below, each occurrence of the function 
symbol D in a D-term corresponds to an instance of the meta-level axiom Det 
in the represented proof. Hence the tree size measures the number of instances 
of Det in the proof. Another view is that each occurrence of D in a D-term 
corresponds to a condensed detachment step, without re-using already proven 
lemmas. The compacted size of a D-term is the number of its distinct compound 
subterms, corresponding to the view that the size of the proof of a lemma is 
only counted once, even if it is used multiply. Tree size and compacted size of 
D-terms appear in [34] as CDcount and length, respectively. 


3.2 Proof Structures, Formula Substitutions and Semantics 


We use a notion of unifier that applies to a set of pairs of terms, as convenient 
in discussions based on the CM [1,9,8]. 


Definition 3. Let M be a set of pairs of terms and let ø be a substitution. (i) o 
is said to be a unifier of M if for all {s,t} € M it holds that so = tø. (ii) o is 
called a most general unifier of M if o is a unifier of M and for all unifiers o’ 
of M it holds that o’ > ø. (iii) ø is called a clean most general unifier of M 
if it is a most general unifier of M and, in addition, is idempotent and satisfies 
Dom(o) U VRng(o) C Var(M). 


The additional properties required for clean most general unifiers do not hold for 
all most general unifiers.? However, the unification algorithms known from the 
literature produce clean most general unifiers [9, Remark 4.2]. If a set of pairs 
of terms has a unifier, then it has a most general unifier and, moreover, also a 
clean most general unifier. 


Definition 4. (i) If M is a set of pairs of terms that has a unifier, then mgu( M) 
denotes some clean most general unifier of M. M is called unifiable and mgu( M) 
is called defined in this case, otherwise it is called undefined. (ii) We make the 
convention that proposition, lemma and theorem statements implicitly assert 
their claims only for the case where occurrences of mgu in them are defined. 


3 The inaccuracy observed by [13] in early formalizations of condensed detachment 
can be attributed to disregarding the requirement Dom(a) UVRng(c) C Var( M). 
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Since we define mgu( M) as a clean most general unifier, we are permitted to make 
use of the assumption that it is idempotent and that all variables occurring in 
its domain and range occur in M. Convention 4.ii has the purpose to reduce 
clutter in proposition, lemma and theorem statements. 

The structural aspects of condensed detachment proofs represented by 
D-terms, i.e., full binary trees, will now be supplemented with associated for- 
mulas. Condensed detachment proofs, similar to CM proofs, involve different 
instances of the input formulas (viewed as quantifier-free, e.g., clauses), which 
may be considered as obtained in two steps: first, “copies”, that is, variants with 
fresh variables, of the input formulas are created; second a substitution is applied 
to these copies. Let us consider now the first step. The framework of D-terms 
permits to give the variables in the copies canonical designators with an index 
subscript that identifies the position in the structure, i.e., in the D-term, or tree. 


Definition 5. For all positions p and positive integers 7 let ti and yp denote 
pairwise different variables. 


Recall that positions are path specifiers. For a given D-term d and leaf position p 
of d the variables a are for use in a formula associated with p which is the copy of 
an axiom. Different variables in the copy are distinguished by the upper index i. 
If p is a non-leaf position of d, then yp denotes the variable in the conclusion of 
the copy of Det that is represented by p. In addition, yp for leaf positions p may 
occur in the antecedents of the copies of Det. The following substitution shift, 
is a tool to systematically rename position-associated variables while preserving 


the internal relationships between the index-referenced positions. 


Definition 6. For all positions p define the substitution shift, as follows: shift, 
= {yq Yp.q |q is a position} U {x} 4 zi q |i > 1 and q is a position}. 


The application of shift, to a term s effects that p is prepended to the position 
indexes of all the position-associated variables occurring in s. The association of 
axioms with primitive D-terms is represented by mappings which we call axiom 
assignments, defined as follows. 


Definition 7. An aziom assignment a is a mapping whose domain is a set 
of primitive D-terms and whose range is a set of terms whose variables are in 
{xt | i > 1}. We say that a is for a D-term d if Dom(a) D> Prim(d). 


We define a shorthand for a form of Łukasiewicz that is suitable for use as a 
range element of axiom assignments. It is parameterized with a position p. 

Lukasiewicz, © ilill, £3), £3), MIE eI, 3))). (ii) 
The mapping {1 > Lukasiewicz.} is an axiom assignment for all D-terms d with 
Prim(d) = {1}. The second step of obtaining the instances involved in a proof 
can be performed by applying the most general unifier of a pair of terms that 
constrain it. The tree structure of D-terms permits to associate exactly one such 
pair with each term position. Inner positions represent detachment steps and 
leaf positions instances of an axiom according to a given axiom assignment. The 
following definition specifies these constraining pairs. 
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Definition 8. Let d be a D-term and let a be an axiom assignment for d. For all 
positions p € Pos(d) define the pair of terms pairing, (d, p) = {yp, a(d|,)shift, } 
if p € Leaf Pos(d) and {Yp.1, i(Yp.2, Yp)} if p E€ InnerPos(d). 


A unifier of the set of pairings of all positions of a D-term d equates for a leaf 
position p the variable y, with the value of the axiom assignment a for the 
primitive D-term at p, after “shifting” variables by p. This “shifting” means that 
the position subscript € of the variables in the axiom argument term a(d|,) is 
replaced by p, yielding a dedicated copy of the axiom argument term for the leaf 
position p. For inner positions p the unifier equates yp.1 and i(Yp.2, Yp), reflecting 
that the major premise of Det is proven by the left child of p. 

The substitution induced by the pairings associated with the positions of a 
D-term allow to associate a specific formula with each position of the D-term, 
called the in-place theorem (IPT). The case where the position is the top posi- 
tion e is distinguished as most general theorem (MGT). 


Definition 9. For D-terms d, positions p € Pos(d) and axiom assignments 
a for d define the in-place theorem (IPT) of d at p for a, Ipt,(d,p), and 
the most general theorem (MGT) of d for a, Mgt,(d), as (i) Ipt,(d,p) © 
P(ypmgu({pairinga (d, q) | q € Pos(d)})). (ii) Mgt.(d) = Ipta(d, €). 


Since Ipt and Mgt are defined on the basis of mgu, they are undefined if the set 
of pairs of terms underlying the respective application of mgu is not unifiable. 
Hence, we apply the convention of Def. 4.ii for mgu also to occurrences of [pt 
and Mgt. If Ipt and Mgt are defined, they both denote an atom whose variables 
are constrained by the clean property of the underlying application of mgu. The 
following proposition relates IPT and MGT with respect to subsumption. 


Proposition 10. For all D-terms d, positions p € Pos(d) and axiom assign- 
ments a for d it holds that Ipt,(d,p) > Mgt,(d|p). 


By Prop. 10, the IPT at some position p of a D-term d is subsumed by the MGT 
of the subterm d|p of d rooted at position p. An intuitive argument is that the 
only constraints that determine the most general unifier underlying the MGT 
are induced by positions of d|,, that is, below p (including p itself). In contrast, 
the most general unifier underlying the IPT is determined by all positions of d. 

The following lemma expresses the core relationships between a proof struc- 
ture (a D-term), a proof substitution (accessed via the IPT) and semantic en- 
tailment of associated formulas. 


Lemma 11. Let d be a D-term and let a be an axiom assignment for d. Then for 
allp € Pos(d) it holds that: (i) If p € LeafPos(d), then YP(a(d|p)) =| Ipta(d, p). 
(ii) If p € InnerPos(d), then Det ^ Ipt,(d,p.1) A Ipt,(d,p.2) |= Ipt,(d,p). 


Based on this lemma, the following theorem shows how Detachment together 
with the axioms in an axiom assignment entail the MGT of a given D-term. 


Theorem 12. Let d be a D-term and let a be an axiom assignment for d. Then 
Det ^ Npeceaf Posta) ¥P(@(alp)) = YMgt,(d). 
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Theorem 12 states that Det together with the axioms referenced in the proof, 
that is, the values of a for the leaf nodes of d considered as universally closed 
atoms, entail the universal closure of the MGT of d for a. The universal closure 
of the MGT is the formula exhibited in Meredith’s proof notation in the lines 
with a trailing D-term, such as lines 2-19 in Fig. 3. 


4 Reducing the Proof Size by Replacing Subproofs 


The term view on proof trees suggests to shorten proofs by rewriting subterms, 
that is, replacing occurrences of subproofs by other ones, with three main aims: 
(1) To shorten given proofs, with respect to the tree size or the compacted 
size. (2) To investigate given proofs whether they can be shortened by certain 
rewritings or are closed under these. (3) To develop notions of redundancy for 
use in proof search. A proof fragment constructed during search may be rejected 
if it can be rewritten to a shorter one. 

It is obvious that if a D-term d’ is obtained from a D-term d by replacing an 
occurrence of a subterm e with a D-term e’ such that t-size(e) > t-size(e’), then 
also t-size(d) > t-size(d’). Based on the following ordering relations on D-terms, 
which we call compaction orderings, an analogy for reducing the compacted size 
instead of the tree size can be stated. 


Definition 13. For D-terms d,e define (i) d>.e = {fEeD|dpo f}24{f€ 
Diep f}. (ii)d>.e @ d>. eand ež. d. 


The relations d >. e and d >e e compare D-terms d and e with respect to the su- 
perset relationship of their sets of those strict subterms that are compound terms. 
For example, D(D(D(1,1),1),1) >. D(1, D(1, 1)) because {D(1, 1), D(D(1, 1), 1)} 
2 {D(1, 1)}. 


Theorem 14. Let d,d’,e,e' be D-terms such that e occurs in d, and d' = dļe > 
e']. It holds that (i) Ife € D ande >, e’, then c-size(d) > c-size(d’). (ii) Ife >e, 
then sc-size(d) > sc-size(d’), where, for all D-terms d sc-size(d) = Y gre csize(e). 


Theorem 14.i states that if d’ is the D-term obtained from d by simultaneously 
replacing all occurrences of a compound D-term e with a “c-smaller’ D-term e’, 
i.e., e€ >. e’, then the compacted size of d’ is less or equal to that of d. As stated 
with the supplementary Theorem 14.ii, the sc-size is a measure that strictly 
decreases under the strict precondition e >e e’, which is useful to ensure ter- 
mination of rewriting. The following proposition characterizes the number of 
D-terms that are smaller than a given D-term w.r.t the compaction ordering ><. 


Proposition 15. For all D-terms d it holds that |{e | d >. e and Prim(e) C 
Prim(d)}| = (c-size(d) — 1 + |Prim(d)|)* + |Prim(d)|. 


By Prop. 15, for a given D-term d, the number of D-terms e that are smaller 
than d with respect to >. is only quadratically larger than the compacted size 
of d and thus also than the tree size of d. Hence techniques that inspect all these 
smaller D-terms for a given D-term can efficiently be used in practice. 
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According to Theorem 12, a condensed detachment proof, i.e., a D-term d 
and an axiom assignment a, proves the MGT of d for a along with instances of 
the MGT. In general, replacing subterms of d should yield a proof of at least 
these theorems. That is, a proof whose MGT subsumes the original one. The 
following theorem expresses conditions which ensure that subterm replacements 
yield a proof with a MGT that subsumes original one. 


Theorem 16. Let d,e be D-terms, let a be an axiom assignment for d and 
for e, and let pı,..., Pn, where n > 0, be positions in Pos(d) such that for all 
i,j € {1,...,n} with i Fj it holds that pi £ pj. If for alli € {1,...,n} it holds 
that Ipt .(d, pi) > Mgt ,(e), then Mgt ,(d) > Mgt,.(dlelp, [elp. ae lelp, )- 


Theorem 16 states that simultaneously replacing a number of occurrences of 
possibly different subterms in a D-term by the same subterm with the property 
that its MGT subsumes each of the IPTs of the original occurrences results in an 
overall D-term whose MGT subsumes that of the original overall D-term. The 
following theorem is similar, but restricted to a single replaced occurrence and 
with a stronger precondition. It follows from Theorem 16 and Prop. 10. 


Theorem 17. Let d,e be D-terms and let a be an axiom assignment for d and 
for e. For all positions p € Pos(d) it then holds that if Mgt,(d|p) > Mgta(e), 
then Mgt,(d) > Mgt,(d[e],). 


Simultaneous replacements of subterm occurrences are essential for reducing the 
compacted size of proofs according to Theorem 14. For replacements according 
to Theorem 17 they can be achieved by successive replacements of individual 
occurrences. In Theorem 16 simultaneous replacements are explicitly considered 
because the replacement of one occurrence according to this theorem can in- 
validate the preconditions for another occurrence. Theorem 17 can be useful in 
practice because the precondition Mgt,(d|,) > Mgt,(e) can be evaluated on 
the basis of a, e and just the subterm d|p of d, whereas determining Ipta(d, p) 
for Theorem 16 requires also consideration of the context of p in d. Based on 
Theorems 16 and 14 we define the following notions of reduction and regularity. 


Definition 18. Let d be a D-term, let e be a subterm of d and let œ be an 
axiom assignment for d. For D-terms e’ the D-term dle +> e’] is then obtained 
by C-reduction from d for a if e >. e’, Mgt,(e’) is defined, and for all positions 
p € Pos(d) such that d|p = e it holds that Ipt,(d, p) > Mgt,(e’). The D-term d is 
called C-reducible for a if and only if there exists a D-term e’ such that dle + e'] 
is obtained by C-reduction from d for a. Otherwise, d is called C-regular. 


If d’ is obtained from d by C-reduction, then by Theorem 16 and 14 it follows 
that Mgt,(d) > Mgt,(d'), c-size(d) > c-size(d’) and sc-size(d) > sc-size(d’). C- 
regularity differs from well known concepts of regularity in clausal tableaux (see, 
e.g., [14]) in two respects: (1) In the comparison of two nodes on a branch (which 
is done by subsumption as in tableaux with universal variables) for the upper 
node the stronger instantiated IPT is taken and for the lower node the more 
weakly instantiated MGT. (2) C-regularity is not based on relating two nested 
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subproofs, but on comparison of all occurrences of a subproof with respect to all 
proofs that are smaller with respect to the compaction ordering. 

Proofs may involve applications of Det where the conclusion Py is actually 
independent from the minor premise Pa. Any axiom can then serve as a trivial 
minor premise. Meredith expresses this with the symbol n as second argument 
of the respective D-term. Our function simp-n simplifies D-terms by replacing 
subterms with n accordingly on the basis of the preservation of the MGT. 


Definition 19. If d is a D-term and a is an axiom assignment for d, then 
the n-simplification of d with respect to œ is the D-term simp-n,(d), where 
simp-n is the following function: simp-n,(d) “ d, if d is a primitive D-term; 
simp-n,,(D(di,d2)) Œ D(simp-n,,(di),n) if Mgt,,D(di,n) = Mgt,D(di, d2), 
where af = aU{n++>k} for a fresh constant k; simp-n,(D(di,d2)) = 
D(simp-n,, (d1), simp-n,, (d2)), else. 


5 Properties of Meredith’s Refined Proof 


Our framework renders condensed detachment as a restricted form of the CM. 
This view permits to consider the expanded proof structures as binary trees or 
D-terms. On this basis we obtain a natural characterization of proof properties 
in various categories, which seem to be the key towards reducing the search space 
in ATP. Table 1 shows such properties for each of the 34 structurally different 
subproofs of proof MER (Fig. 3). Column M gives the number of the subproof 
in Fig. 3. We use the following short identifiers for the observed properties: 


Structural Properties of the D-Term. These properties refer to the respec- 
tive subproof as D-term or full binary tree. DT, DC, DH: Tree size, compacted 
size, height. DKz, DK pr: “Successive height”, that is, the maximal number of 
successive edges going to the left (right, resp.) on any path from the root to a 
leaf. DP: Is “prime”, that is, DT and DC are equal. DS: Relationship between 
the subproofs of major and minor premise. Identity is expressed with =, the 
subterm and superterm relationships with < and >, resp., and the compaction 
ordering relationship (if none of the other relationships holds) with <e and >e. 
In addition it is indicated if a subproof is an axiom or n. DD: “Direct sharings”, 
that is, the number of incoming edges in the DAG representation of the overall 
proof of all theorems. DR: “Repeats”, that is, the total number of occurrences 
in the set of expanded trees of all roots of the DAG. 


Properties of the MGT. These properties refer to the argument term of the 
MGT of the respective subproof. TT, TH: Tree size (defined as for D-terms) and 
height. TV: Number of different variables occurring in the term. TO: Is “organic” 
[21], that is, the argument term has no strict subterm s such that P(s) itself is 
a theorem. We call an atom weakly organic (indicated by a gray bullet) if it is 
not organic and the argument term is of the form i(p,t) where p is a variable 
that does not occur in the term ¢ and P(t) is organic. For axiomatizations of 
fragments of propositional logic, organic can be checked by a SAT solver. 


Regularity. RC: The respective subproof as D-term is C-regular (see Def. 18). 
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M DT DC DH DK: DKr DP DS DD DR TT TC TH TV TO RC MT MC ITv ITs IHv IHu 
1.1 1 0 0 0 0 060 — 17554 6 6 3 4¢e e 0 04451 203 18 11 
2. D11 111 l l e 1=1 145 8 7 4 5o e 1 11640 220 17 12 
3. D12 2 2 2 1 26 4511 8 4 6¢ o 2 21881 252 17 12 
4. D31 3 3 3 2 2e pl 45 5 5 4 4 ° 3 3 689 92 16 11 
5. D4n 244 4 3 2% òn 45 4 4 3 30 o 4 4 688 91 15 10 
6. D15 5 5 5 3 2e 45 6 5 3 40 oè 5 51667 198 15 10 
7.D16 6 6 6 3 3e 45 76 4 50 oè 6 61802 208 16 11 
8. D17 7 7 7 3 4% 45 9 7 46o oè 7 7 2648 303 16 11 
9. D81 8 8 8 3 4e 1 45 5 5 4 4 ° 8 81032 119 15 10 
0. D9n 3 999 3 4e pn 545 44 330 o 9 91031 118 14 9 
1.D10.1 4 10 10 10 4 4e 237 4 4 3 30 e 10 10 448 60 13 9 
2. D1.11 11 11 11 4 4e 23 7 7 55e e 11 11 498 73 14 10 
3. D1.12 12 12 12 4 4e 23 12 8 5 6 oe e 12 121157 168 14 10 
4. D1.13 13 13 133 4 4e 23 10 9 6 Te e 13 [12,13] 1050 159 15 11 
5. D1.14 14 14 14 4 5e 23 15 10 6 8e e 14 [12,14] 1657 246 15 11 
6. D15.1 15 15 15 4 5e pl 23 9 8 5 6 e 15 [12,15] 684 100 14 10 
7.D16.n 5 16 16 16 4 5e >n 2 23 8 7 4 50 o 16 [12,16] 683 99 13 9 
8.D17.1 6 171717 4 5e pl 318 7 6 3 4o o 17 [12,17] 395 56 12 8 
9. D18.11 7 28 18 18 5 5- B 11476 4 4o e 14 [12,14] 209 61 11 9 
20. D19.1 8 29 19 19 6 5- Bl 214 9 8 5 5o oè 15 [12,15] 132 38 10 8 
21. D1.20 10 30 20 20 6 5- 14 2 1012 9 5 Geo e 16 [12,16] 158 47 10 8 
22. D21.21 61 21 21 6 5- = 1 510 9 5 6 e [23,33] [12,17] 53 16 9 7 
23. D22.n 11 62 22 2 6 5- >n 1 5 9 8 4 50 ə [23,34] [12,18] 52 15 8 6 
24. D17.2312 79 23 23 6 5- < 2 5 9 8 4 5e ə [23,51] [12,23] 57 16 7 5 
25. D24.18 13 97 24 24 6 5- > 2 2 76 4 40 e [23,69] [12,24] 27 17 6 5 
26. D20.10 9 39 20 20 7 5- B 2 43 22 2% — 8 6 27 7 6 4 
27. D24.26 14119 25 24 7 5- > 2 3 5 5 3 36e e ([23,91][12,25] 24 7 6 4 
28. D10.27 15129 26 25 7 5- < 1 2 3 3 3 2e e [23,101] [12,26] 19 12 6 5 
29.D18.2816147 27 26 7 5- < 2 2 5 5 4 3 e¢ e [23,36] [12,26] 19 12 6 5 
30. D29.29 295 28 27 7 6- = 1 110 7 5 4e e [23,239] [12,27] 13 13 5 5 
31. D25.30 393 30 28 7 7- <. 1 1 7 7 5 4e e [23,121] [12,29] 13 13 5 5 
32.D31.2517491 31 29 7 7- > 0 1 5 5 3 3% e [23,191] [12,30 5 5 3 3 
33.D27.2618159 26 25 7 5- > 0 13 3 3 2e e 15 11 3 3 3 3 
34. D10.1019 19 10 10 4 4- = 0 122 2 26¢ è 7 6 2 2 2 2 


Table 1. Properties of all subproofs of the proof MER [24] shown in Fig. 3. 


Comparisons with all Proofs of the MGT. These properties relate to the 
set of all proofs (as D-terms) of the MGT of the respective subproof. MT, 
MC: Minimal tree size and minimal compacted size of a proof. These values 
can be hard to determine such that in Table 1 they are often only narrowed 
down by an integer interval. To determine them, we used the proof MER, proofs 
obtained with techniques described in Sect. 6, and enumerations of all D-terms 
with defined MGT up to a given tree size or compacted size. 


Properties of Occurrences of the IPTs. The respective subproof has DR 
occurrences in the set of expanded trees of the roots of the DAG, where each 
occurrence has an IPT. The following properties refer to the multiset of argu- 
ment terms of the IPTs of these occurrences. ITy, IT y: Maximal tree size and 
rounded median of the tree size. IHy, THjz: Maximal height and rounded me- 
dian of the height. In Table 1 these values are much larger than those of the 
corresponding columns for the MGT, i.e, TT and TH, illustrating Prop. 10. 


6 First Experiments 


First experiments based on the framework developed in the previous sections 
are centered around the generation of lemmas where not just formulas but, in 
the form of D-terms, also proofs are taken into account. This leads in general 
to preference of small proofs and to narrowing down the search space by re- 
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Lemmas # Time | Prover Time | DC DT DH 

1. Lukasiewicz* 32 435 29 
2. Meredith 31 491 29 
3. Prover9 37s 94 304,890 40 
4, Prover9* 37s 83 8,217 38 
5. Prover9* depth < 7 6s | 102 19,113 48 
6. PrimeCore(17) 17 Prover9* 30s 44 763 28 
7. ProofSubproof (93,7) 291 78s | Prover9* 3s 51 1,405 31 
8. ProofSubproof (93,7) 291 78s | CMProver 2s 30 394 29 
9. ProofSubproof(100,8) 330 94s | CMProver As 30 535 29 
10. Reduction of (8.) 48 191 24 


Table 2. Proof dimensions of various proofs of problem LDS. 


stricted structuring principles to build proofs. The experiments indicate novel 
potential calculi which combine aspects from lemma-based generative, bottom- 
up, methods such as hyperresolution and hypertableaux with structure-based 
approaches that are typically used in an analytic, goal-directed, way such as the 
CM. In addition, ways to generate lemmas as preprocessing for theorem proving 
are suggested, in particular to obtain short proofs. This resulted in a refinement 
of Lukasiewicz’s proof [19], whose compacted size is by one smaller than that of 
Meredith’s refinement [24] and by two than Lukasiewicz’s original proof. 

Table 2 shows compacted size DC, tree size DT and height DH of various 
proofs of £DS. Asterisks indicate that n-simplification was applied with reducing 
effect on the system’s proof. Proof (1.) is the one by Lukasiewicz [19], translated 
into condensed detachment, proof (2.) is proof MER (Fig. 3) [24]. Rows (3.)—(5.) 
show results from Prover9, where in (5.) the value of max_depth was limited 
to 7, motivated by column TH of Table 1. Proof (4.) illustrates the effect of n- 
simplification. For proofs (6.)—(9.) additional axioms were supplied to Prover9 
and CMProver [5,35,36], a goal-directed system that can be described by the 
CM. Columns indicate the lemma computation method, the number of lem- 
mas supplied to the prover and the time used for lemma computation. Method 
PrimeCore adds the MGTs of subproof 18 from Table 1 and all its subproofs 
as lemmas. Subproof 18 is the largest subproof of proof MER that is prime and 
can be characterized on the basis of the axiom — almost uniquely — as a proof 
that is prime, whose MGT has no smaller prime proof and has the same number 
of different variables as the axiom, i.e., 4, and whose size, given as parameter, 
is 17. Method ProofSubproof is based on detachment steps with a D-term and a 
subterm of it as proofs of the premises, which, as column DS of Table 1 shows, 
suffices to justify all except of two proof steps in MER. It proceeds in some anal- 
ogy to the given clause algorithm on lists of D-terms: If d is the given D-term, 
then the inferred D-terms are all D-terms that have a defined MGT and are of 
the form D(d, e) or D(e, d), where e is a subterm of d. To determine which of the 
inferred D-terms are kept, values from Table 1 were taken as guide, including 
RC and TO. The first parameter of ProofSubproof is the number of iterations 
of the “given D-term loop”. Proof (9.) can be combined with Peirce and Syll to 
the overall proof with compacted size 32, one less than MER. The maximal value 
of DK, is shown as second parameter, because, when limited to 7, proof (9.) 


4 All machine results refer to a system with Intel i7-8550U CPU and 16 GB RAM. 
Results for further systems: KRHyper* [26]: 1.610 s, DC: 73; EF 2.5 [30]: 30 s, proof 
length 91; Vampire 5.4.1 [33] -mode casc -t 300: 128 s, proof length 144. 
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cannot be found. Proof (10.), which has a small tree size, was obtained from (8.) 
by rewriting subproofs with a variation of C-reduction that rewrites single term 
occurrences, considering also D-terms from a precomputed table of small proofs. 


7 Conclusion 


Starting out from investigating Lukasiewicz’s classic formal proof [19], via its 
refinement by Meredith [24] we arrived at a formal reconstruction of Meredith’s 
condensed detachment as a special case of the CM. The resulting formalism yields 
proofs as objects of a very simple and common structure: full binary trees which, 
in the tradition of term rewriting, appear as terms, D-terms, as we call them. To 
form a full proof, formulas are associated with the nodes of D-terms: axioms with 
the leaves and lemmas with the remaining nodes, implicitly determined from the 
axioms through the node position and unification. The root lemma is the most 
general proven theorem. Lemmas also relate to compressed representations of 
the binary trees, for example as DAGs, where the re-use of a lemma directly 
corresponds to sharing the structure of its subproof. For future work we intend 
to position our approach also in the context of earlier works on proofs, proof 
compression and lemma introduction, e.g., [38,12], and think of compressing 
D-Terms in forms that are stronger than DAGs, e.g., by tree grammars [18]. 

The combination of formulas and explicitly available proof structures natu- 
rally leads to theorem proving methods that take structural aspects into account, 
in various ways, as demonstrated by our first experiments. This goes beyond the 
common clausal tableau realizations of the CM, which in essence operate by enu- 
merating uncompressed proof structures. The discussed notions of regularity and 
lemma generation methods seem immediately suited for further investigations 
in the context of first-order theorem proving in general. For other aspects of 
the work we plan a stepwise generalization by considering further single axioms 
for the implicational fragment IF [21,19,32], single axioms and axiom pairs for 
further logics [32], the about 200 condensed detachment problems in the LCL 
domain of the TPTP, problems which involve multiple non-unit clauses, and 
adapting D-terms to a variation of binary resolution instead of detachment. In 
the longer run, our approach aims at providing a basis for approaches to theo- 
rem proving with machine learning (e.g. [10,15]). With the reification of proof 
structures more information is available as starting point. As indicated with our 
exemplary feature table for Meredith’s proof, structural properties are consid- 
ered thereby from a global point of view, as a source for narrowing down the 
search space in many different ways in contrast to just the common local view 
“from within a structure’, where the narrowing down is achieved for example by 
focusing on a “current branch” during the construction of a tableau. A general 
lead question opened up by our setting is that for exploring relationships between 
properties of proof structures and the associated formulas in proofs of meaning- 
ful theorems. One may expect that characterizations of these relationships can 
substantially restrict the search space for finding proofs. 


Acknowledgments. We appreciate the competent comments of all the referees. 
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Abstract. We present novel reductions of the propositional modal logics 
KB, KD, KT, K4 and K5 to Separated Normal Form with Sets of Modal 
Levels. The reductions result in smaller formulae than the well-known 
reductions by Kracht and allow us to use the local reasoning of the prover 
KgP to determine the satisfiability of modal formulae in these logics. We 
show experimentally that the combination of our reductions with the 
prover KgP performs well when compared with a specialised resolution 
calculus for these logics and with the built-in reductions of the first-order 
prover SPASS. 


1 Introduction 


The main motivation for reducing problems in one logic (the source logic) to 
‘equivalent’ problems in another logic (the target logic) is to exploit results and 
tools for the target logic to solve theoretical or practical problems in the source 
logic. For propositional modal logics this approach has been researched exten- 
sively for reductions of the satisfiability problem in these logics to the satisfiabil- 
ity problem in ‘stronger’ logics such as first-order logic [10,20], the second-order 
theory of n successors [6], simple type theory [4], and regular grammar logics [19]. 

An alternative approach is to reduce propositional modal logics to a ‘weaker’ 
logic, in particular, the basic modal logic K. For extensions of K with one of the 
axioms B, D, altı, T, and 4, Kracht [12] defines reduction functions of their global 
and local satisfiability problem to the corresponding problem in K and proves 
their correctness. He also defines a reduction function for K5, the extension 
of K with 5, to K4, but this reduction is incorrect as not all theorems of K4 
are theorems of K5. Several features of Kracht’s approach are relevant to our 
work. First, as is not uncommon in modal logic, he treats the modal operator 
© as abbreviation for =O-, that is, O is the only modal operator occurring 
in modal formulae. Second, the basic idea underlying his reduction functions 
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is for a given modal formula y to generate sufficiently many instances A of 
a modal axiom A so that y is KA-satisfiable iff y A A is K-satisfiable. Third, 
Kracht is only concerned with preservation of the computational complexity 
of the satisfiability problem under consideration, as well as the preservation of 
other theoretical properties. For instance, the local satisfiability problem in the 
modal logics covered by Kracht is PSPACE-complete. So, it is sufficient to ensure 
that A is polynomial in size with respect to y. As Kracht himself concludes, his 
method offers a uniform way of transferring results about one modal logic to 
another, but may not be as useful for practical applications. 

In [16,15] we have introduced a new normal form for basic multi-modal logic, 
called Separated Normal Form with Modal Levels, SNF „z, that uses labelled 
modal clauses. These labels refer to the level within a tree Kripke structure 
at which a modal clause holds. This can be seen as a compromise between ap- 
proaches that label formulae with worlds at unspecified level [1,3] and approaches 
that label formulae with paths [5,23]. A combination of a normal form transfor- 
mation for modal formulae and a resolution-based calculus for labelled modal 
clauses can then be used to decide local and global satisfiability in basic modal 
logic. In [17,18] we have presented KgP, an implementation of that calculus, to- 
gether with an experimental evaluation that indicates that KgP performs well 
if propositional variables are evenly spread across a wide range of modal levels 
within the formulae one wants to decide. 

A feature of SNF,,, is its use of additional propositional symbols as ‘surro- 
gates’ for subformulae of a modal formula vy. In the following we take advantage 
of the availability of those surrogates to provide a novel transformation from ex- 
tensions of K with a single one of the axioms B, D, T, 4 and 5 to SNF,,,,. Another 
novel aspect is that we modify the normal form so that it uses sets of modal 
levels as labels instead of a single modal level. In K we only need a definition of 
a surrogate at the modal level at which the corresponding subformula occurs in 
p. But in KB, KT, K4 and K5, we need a definition at every reachable modal 
level, of which there can be many. We call the resulting normal form, Separated 
Normal Form with Sets of Modal Levels, SNF m1- 

The structure of the paper is as follows. In Section 2 we recap common con- 
cepts of propositional modal logic including its syntax and semantics. Section 
3 defines SNF; and the reductions of K, KB, KD, KT, K4 and K5 to SNF mr 
Correctness is proved in Section 4. Related work is discussed in Section 5. In 
Section 6 we compare the performance of a combination of our reductions and 
the modal-layered resolution calculus implemented in prover KgP with reso- 
lution calculi specifically designed for the logics under consideration and with 
translation-based approaches built into the first-order theorem prover SPASS. 


2 Preliminaries 


The language of modal logic is an extension of the language of propositional 
logic with a unary modal operator O and its dual ©. More precisely, given a 
denumerable set of propositional symbols, P = {p,po,q,q0,t,to,-..} as well as 
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propositional constants true and false, modal formulae are inductively defined 
as follows: Constants and propositional symbols are modal formulae. If y and w 
are modal formulae, then so are ~y, (pA Y), (y VV), (yp > y), Oy, and Oy. 
We also assume that A and V are associative and commutative operators and 
consider, e.g., (pV(qVr)) and (rV (qVp)) to be identical formulae. We often omit 
parentheses if this does not cause confusion. By var(p) we denote the set of all 
propositional symbols occurring in y. This function straightforwardly extends 
to finite sets of modal formulae. A modal axiom (schema) is a modal formula w 
representing the set of all instances of W. 

A literal is either a propositional symbol or its negation; the set of literals is 
denoted by L. We denote by ~l the complement of the literal l € L, that is, ~l 
denotes ~p if l is the propositional symbol p, and ~l denotes p if l is the literal 
ap. A modal literal is either Ol or Ol, where 1 € L. 

A (normal) modal logic is a set of modal formulae which includes all propo- 
sitional tautologies, the axiom schema O( > w) + (Gy > OW), called the 
axiom K, is closed under modus ponens (if F y and F p > w then F 4%) and the 
rule of necessitation (if F y then F 

K is the weakest modal logic, thot is, the logic given by the smallest set 
of modal formulae constituting a normal modal logic. By KX we denote an 
extensions of K by a set X of axioms. 

The standard semantics of modal logics is the Kripke semantics or possible 
world semantics. A Kripke frame F is an ordered pair (W, R) where W is a 
non-empty set of worlds and R is a binary (accessibility) relation over W. A 
Kripke structure M over P is an ordered pair (F, V} where F is a Kripke frame 
and the valuation V is a function mapping each propositional symbol in P to 
a subset V(p) of W. We say M = (F,V) is based on the frame F. A rooted 
Kripke structure is an ordered pair (M, wo) with wo € W. To simplify notation, 
in the following we write (W, R, V} and (W, R, V, wọ) instead of ((W, R), V} and 
(((W, R}, V}, wo), respectively. 

Satisfaction (or truth) of a formula at a world w of a Kripke structure M = 
(W, R, V) is inductively defined by: 

M,w) = true; (M,w) F false; 

w) Ep iff w € V(p), where p € P; 

M, w) = 7» iff (M, w) K 9; 

M,w) = (p^ y) iff (M,w) = vy and (M, w) H Y; 
M, w) = (pv y) iff (M, w) = y or (M, w) = Y; 
M, w) 
M, w) 


= (o > Y) iff (M, w) Kw or (M, w) = Y 
= Oy iff for every v, w Rv implies (M, v) = y; 
(M, w) = Oy iff there is v, w Rv and (M, v) = y. 


If (M, w) — ¢ holds then M is a model of p, ¢ is true at w in M and M satisfies 
y. A modal formula ọ is satisfiable iff there exists a Kripke structure M and a 
world w in M such that (M, w) — vy. A modal formula ¢ is globally true or valid 
in a Kripke structure M if it is true at all worlds of M; it is valid if it is valid in 
all Kripke structures. 
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Nameļ Axiom Frame Property 


D p — Op |Serial Vudu.u Rw 


pop Reflexive |Vw.w Rw 


yp > OOy|Transitive Yuvw.(u Rv Av Rw) >u Rw 


T 
B p> 00y |Symmetric\Vuw.v Rw > w Rv 
4 
5 


Op > OQy|Euclidean |Yuvw.(u Rv ^u Rw) >v Rw 


Table 1. Modal axioms and relational frame properties 


In the following we are interested in extensions of K with the axiom schemata 
shown in Table 1. Each of these axiom schemata defines a class of Kripke frames 
where the accessibility relation R satisfies the first-order property stated in the 
table. Given a normal modal logic L with corresponding class of frames Ẹ, we say 
a modal formula y is L-satisfiable iff there exists a frame F € §, a valuation V 
and a world wo € F such that (F, V, wo) = y. 

A path rooted at w of length k, k > 0, in a frame F = (W, R) is a sequence 


wW = (wo, W1,..., Wk) where for every i, 1 < i < k, wi-1 R wi. We say that the 
path (wo, w1,..., Wp) connects wo and wp. For a path Ù = (wọ,..., wp) and 
world we41 with wg R Wk+1, WO wWk+ı denotes the path (wo, ..., Wk, Wk41). A 


path (wg) of length 0 is identified with its root wọ. We denote the set of all paths 
rooted at a world wo in F by F [wo] and the set of all paths by F. The function 
tm: F+W maps every path Ñ = (wo,..., Wx) to its terminal world wg while 
the function len : F — N maps every path w = (wo, W1,.--, Wx) to its length k. 
A rooted Kripke structure M = (W, R, V, wo) is a rooted tree Kripke structure 
iff R is a tree, that is, a directed acyclic connected graph where each node has at 
most one predecessor, with root wo. It is a rooted tree Kripke model of a modal 
formula y iff (W, R, V, wọ) = ọ. In a rooted tree Kripke structure with root wo 
for every world wg E€ W there is exactly one path w connecting wo and wz; the 
modal level of wp (in M), denoted by ml m(wp), is given by len(w). 
Let F = (W, R) be a Kripke frame with w € W. The unravelling F“|w] of F 
at w is the frame (W, R) where: 
— W = Flu] is the set of all rooted paths at w in F; 
— for all y, w € W, if w = vow for some w € W, then TRU. 
Let F = (W, R) and F’ = (W’, R’) be two Kripke frames. A function f: W => W” 
is a p-morphism (or a bounded morphism) from F to F’ if the following holds: 
— if v Rw, then f(v) R' f(w). 
— if f(u) R' w, then there exists v € W s.t. f(v) = w and u Rv. 
Analogously for Kripke models. For F = (W, R}, M’ = (F, V', wọ), and M = 
(F” [wo], V, (wo)), the function trm is a p-morphism from M to M”. 
When considering local satisfiability, the following holds (see, [8]): 


Theorem 1. Let p be a modal formula. Then p is K-satisfiable iff there is a 
finite rooted tree Kripke structure M = (F, V, wọ) such that (M, wo) = ¢. 
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ypAp=>y ^p = false true > true true => false —-y > y 


yVyp>yp Vyp >= true Ofalse = false -false > true 
yAtrue=> vy y^ false => false y V false > p y V true > true 


Table 2. Rewriting Rules for Simplification 


For the normal form transformation presented in the next section we assume 
that any modal formula y has been simplified by exhaustively applying the 
rewrite rules in Table 2 and is in Negation Normal Form (NNF), that is, a 
formula where only propositional symbols are allowed in the scope of negations. 
We say that such a formula is in simplified NNF. 


3 Layered Normal Form with Sets of Levels 


A formula to be tested for satisfiability is first transformed into a normal form 
called Separated Normal Form with Sets of Modal Levels, SNF,,,,,, whose lan- 
guage extends that of modal logic with labels consisting of sets of modal levels. 
Informally, we write S : p, where S is a set of natural numbers, to denote that 
a formula ọ is true at modal levels ml € S. We write x : y instead of N : ọ. 

We introduce some notation that will be used in the following. Let S* = 
{1+1 EN|leE S}, S7 ={l-1EN|le S}, and SŽ? = {n | n > min(S)}, where 
min(S) is the least element in S. Note that the restriction of the elements being 
in N implies that S~ cannot contain negative numbers. 

The labels in SNF „m; work as a kind of weak universal operator, allowing us 
to talk about formulae that are satisfied at all worlds in a given set of modal 
levels. Formally, we restrict ourselves to rooted tree Kripke structures M = 
(W, R, V, wo) and if S is a set of modal levels, then by M[S] we denote the set of 
worlds that are at a modal level in S, that is, M[S] = {w € W | mlyr(w) € S}. 
The satisfaction of labelled formulae in a rooted tree Kripke structure M is then 
defined as follows: 


MES: ¢ iff for every world w € M[S], we have (M, w) = y. 


If M H S: y, then we say that S : p holds in M. Note that if S = Ø, then 
M E S : ¢ trivially holds. For a set & of labelled formulae, ME @iff MES: » 
for every S : y in , and we say ® is K-satisfiable. 

A labelled modal formula is then an SNF,,,, clause iff it is of one of the 
following forms: 


— Literal clause S Viat 
— Positive modal clause §: V — ol 
— Negative modal clause §: V + Ol 


where S C N and J, l’, lp are propositional literals with 1 < b < r, r € N. Positive 
and negative modal clauses are together known as modal clauses. We regard a 
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literal clause as a set of literals, that is, two clauses are the same if they contain 
the same set of literals. 

We assume that the set P of propositional symbols is partitioned into two 
infinite sets Q and T such that for every modal formula w we have var(w) C Q 
and there exists a propositional symbol ty € T uniquely associated with 7. 

Given a modal formula y in simplified NNF and L € {K, KB, KD, KT, K4, K5}, 
then we can obtain a set ®; of clauses in SNF „z such that y is L-satisfiable iff 
Pz is K-satisfiable as Pz = {{0} : tp} U pL({0} : ty > p), where pz is defined 
as follows: 


)=0 
p(s: OASE {S' : at} 
PL(S : t > (Y1 A 2)) = {8 : at V n(y1), S : =t V n(Y2)} U ôL (S, p1) U4z(S, p2) 
pr(S:t>y) ={S:-tv vy} 
if w is a disjunction of literals 
PLCS : t + (Y1 V b2)) = {9 : =t V nr) V m(b2)} U ôr (S, Y1) U 8r (S, Wa) 
if %1 V we is not a disjunction of literals 
PL(S : t > Op) = {5 : t > On()} U 6r (SF, Y) 
PLIS : t > OW) = PL(S : t > OY) U A(S : t > OW) 


where 7 and 6,7 are defined as follows: 


p, ifwisa 0, if isa 
np) = literal 61(S,%) = literal 
ty, otherwise pL(S : ty => Y), otherwise 


and functions Pz, Az are defined as shown in Table 3. The function 7 maps 
a propositional literal w to itself while it maps every other modal formula w 
to a new propositional symbol ty € T uniquely associated with w. We call ty 
the surrogate of Y or simply a surrogate. The functions Pkg and Pxs introduce 
additional propositional symbols, called supplementary propositional symbols, 
to-to, € T and toto, € T, respectively, that do not correspond to subformulae 
of the formula we are transforming. 

Intuitively, Pkg is based on the following consideration: Take a world w in 
a Kripke structure M with a symmetric accessibility relation R. If there exists 
a world v with w Rv such that (M, v) = Ow, then (M, w) H Y. Now, take the 
contrapositive of that statement: If (M, w) jÆ y, then for every world v with 
w Rv, (M,v) w. Equivalently, (M, w) = w~ or (M,w) = 0704. This is 
expressed by the formula 7(w) V to~tay. For Pks, the formula tota, > Ototoy 
expresses an instance of axiom schema 5, Oy > Oy, with y = Oy, i.e., 
Soy > 004. The contrapositive of axiom schema 5 is OOy > Oy, equivalent 
to nOOy V Oy. For p = w this is expressed by the formula —to4,,, V tay. For 
the formula —to;,,, + O-toy, consider -OOy. By duality of O and Q, this is 
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L ||PL(S : toy > Ow) Az (S$: toy > OY) 
K IIS: toy > On) LCST, Y) 

KT||S : toy > On(w), S : atoy V n(w) ôL(S U ST, Y) 
KD/|S : toy > On(Y), S : toy > On) 61(S", ) 

KB||S : toy > On(v), 6r(S~ U St, Y) 


Si nly) V totoy: 9 : totoy > O>toy 
K4 || SŽ : toy > On(w), S= : toy > Otoy 5x ((S*)=, wv) 
K5 || x : toy > On(w), OL (x, Y) 


*: totoy V toy, *: totoy — Stoy, 


*: “totoy > Onrtoy, *: totoy > totoy 


Table 3. Transformation of O0-formulae in modal logic L 


equivalent to ~>=0-Oy and O-Ow. So, n©OwW > O-Oy in every normal modal 
logic, not only K5. The remaining labelled formulae introduced by Pkg and Pxs 
ensure that supplementary propositional symbols are defined. For the remaining 
logics the additional clauses are also based directly on the axiom schemata. 

To simplify presentation in the following, we define a function 7, as follows: 


ng (pi A p2) = (yi) A nlp) np (Yi V p2) = n(¥1) V nlp) 
np (Oy) = On) np(O~) = Only) 


and we treat the two clauses S : -ty, nypa V n(v1) and S : atyn V (2) 
resulting from the normal form transformation of yı A Y2 as a single ‘clause’ 
St atya vgs VF (v1 A W2). We also interchangeably write S : atoy V ns (Oy) for 
S : toy > np(Gw) and, analogously, S : stoy V N(Y) for S : toy > ns (Oy). 
We then call any clause of the form S : =ty V n¢(w) a definitional clause. 


Definition 1. Let ® be a set of SNF,,,,, clauses. We say ty € T occurs at level 
ml in ® iff either 


(a) there exists a clause S : V in B with ml € S such that Ò is a propositional 
formula and ty occurs positively in Ù, or 

(b) there exists a clause S : toy —> Oty in ® with ml—1€ S, or 

(c) there exists a clause S : toy > Oty in B with ml —1 € 5. 


Definition 2. Let ® be a set of SNF,,,,, clauses. Then P is definition-complete 
iff for every ty E T and every level ml, if ty occurs at level ml in ® then there 
exists a clause S : nty V nf(Y) in ® with ml € S. 


Theorem 2. Let L € {K,KB,KD,KT,K4,K5}. Then Pr = {{0} : ty} U 
PL({0} : tọ > y) is definition-complete. 
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Proof. By induction over the computation of ®z. It is straightforward to see that 
the transformation of labelled formulae S : t —> (Yı A we) and S : t > (Yı V %2) 
only introduces surrogates at levels in S and Ay then adds definitional clauses 
for those surrogates. The transformation of a labelled formula S : toy > Ow may 
introduce a surrogate at levels in St and 6, (St, yY) then adds definitional clauses 
for those surrogates. The transformation of a labelled formula S : toy —> Ow 
depends on the logic L. We can see that for every level at which a new surrogate 
occurs in P(S : toy > Ow), then Az(S : toy > Oy) contains a definitional 
clause for it at that level. 


4 Correctness 


Due to space constraints we only prove the correctness of the transformation for 
KB. We first state several lemmata that are used in the correctness proofs for 
all logics. 


Lemma 1. Let ® be a set of definitional clauses such that every ty occurring 
in ® is an element of T and all other propositional symbols occurring in ® are 
in Q. Let M = (W,R,V,wo) be a rooted Kripke structure. Let (W, R) be the 
unravelling of (W, R) at wo. Let M = (W,R,Vs,(wo)) be a Kripke structure 
such that 


— Vz(p )= {we w f trm(w) € V(p)} for every propositional symbol p € Q, and 
= Vx (ty) = = {0 € W | (M, w) H yY} for every surrogate ty E TA var(®). 


Then ME @. 


Lemma 2. Let y be a L-satisfiable modal formula in simplified NNF where 
L is a normal modal logic and let B = {{0} : ty} U pk({0} : ty > p). Let 
M = (W,R,V,wo) be a rooted K model of p. Let (W, R) be the unravelling of 
(W, R) at wo. Let M =(W,R,V,(wo)) be a Kripke structure such that 


— V(p) = {0 € W | trm(ū) € V(p)} for every propositional symbol p € var(y), 
and 
— V(ty) = {0 € W | (M,@) H Y} for every surrogate ty € TM var(®). 


Then M = ©. 


=> 


Lemma 3. Let M = (W, R, V, wo) be a rooted Kripke structure. Let (W, 
the unravelling of (W, R) at wo. Let M = (W,R, Vy, (wo)) where Vs(p) = 
W | trm() € V(p)} for every propositional symbol p € Q. 

Then for every modal formula w over Q and for every world Ù € w, (M, w) 


p iff (M, trm(w)) H 


Lemma 4. Let p be a modal formula in simplified NNF. Let k = {10} : 
ty}UpK({O} : tọ > p). Let ® with Pk C @ be a definition-complete set of SNF, 
clauses, let M = (W, R, V, wo) be a tree K model of ® and let M' = (W, R’, V, wo) 
be such that 


R) be 
í 
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(4a) RC R'; 

(4b) for every modal clause S : toy —> On(w) in ® and every world w € M[S], 
(M', w) ia toy > n(w); 

c) for every modal clause S : toy > Oty in P and all worlds v,w €E md 

4c) f dal clause S : toy y in ® and all world W, if 
(i) w € M[S] and (ii) wR'v then (iii) there exists a clause S" : aty, vns (p) 
in ® with v € M[S"]. 

Then (M', wo) E yp. 


Theorems 3 and 4 now state the correctness of our transformation for KB. 


Theorem 3. Let p be a modal formula in simplified NNF. Let Bg = {{0} : 
to} U pke({0} : tp > p). If p is KB-satisfiable, then Pg is K-satisfiable. 


Proof. The main idea is to show that given a rooted KB model of y, then a small 
variation of its unravelling is a rooted tree K model of $p. 
Let M = (W, R, V, wo) be a rooted KB model of y with (M, wo) = y and 
symmetric relationship R. Let (W, R) be the unravelling of (W, R) at wo. Let 
Ms = (W, R, Vs, (wo)) where 
= Ve(p )={wew i] trm (2 0) € V(p)} for every propositional symbol p € var(y), 
— Va(ty) = ={w E Ww | (Me, w) | Y} for every surrogate ty € var(®g) \ var(y) 
introduced by rewriting, and 

— Ve(tostey) = = {w € W | (Mg, w) H 0-0y} for every supplementary propo- 
sitional symbol to-z,,, introduced in the normal form transformation of a 
labelled formula S : toy > Oy. 


Note that Vg is well-defined as for every surrogate ty E€ T, w only contains 
propositional symbols in Q. Let Pk = {{0} : ty} U px({O} : ty > p). 

We now consider the clauses occurring in ®g and show that they hold in Mp. 
By Lemma 2 it follows that Mp = k. Also, all definitional clauses in g \ k 
are true in Me by Lemma 1. 

Next consider clauses of the form 


(1) S: np) V tonto, (2) S’: to tow = “toy 


where toy is a surrogate for Oy. These are not in Pk. We show both are true in 


Mg. We do so by first considering that to-¢,,, is true at a world and then that 
it is false. 7 T x 

Case (a): Let w € Mg[S’] with (MgB, w) H to-1,,- Clearly, (Mp, w) H nf) V 
to-to,- Also, by definition of Me, (Mg, Ù) H 070%. So, for every v € W 
with w RG, (Mg, 0) H Ow. As toy is a surrogate for Ow, by definition of 
Vb, 0 € Vp(toy) and (MgB, 0) H -toy. Thus, (Me, w) H Ontoy and, by the 
semantics of capcom, (Mz, w) H toto, > Ontoy. 

Case (b): Let @ € Mg[S’] with (Mg, w) p to-toy- Clearly, by the semantics 
of implication, (Me, W) H to~toy > O~toy. Also, by definition of Vp, w g 
Va (to-tey) implies (Mg, w) =O which in turn implies (Mg, w) H OO. So, 
there exists ¢ € W with WR and (Mg, U) H Oy. Since trm is a p-morphism from 
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Mg to M, trm(w) Rtrm(v). Since R is symmetric, we also have trm(v) Rtrm(w) 
and by construction of Mg, for ï = o trm(w) we have @ Rw. Since (Mg, iv) = 
Y, (Me,ū) H Y. As trm is a p-morphism and (M,trm(i)) H w and since 
trm(w) = trm(i), (M,trm(w)) H Y. By Lemma 3, from (M,trm(w)) H Y we 
obtain (Mg, ©) = 4. If y is a literal, then n() = w and (M, w) H niy). If 
W is not a literal, then n(Y) = ty and from (Mg, w) Ew, by definition of Va; 
We Vp (ty) and (Mp, w) H ty. So, (M, w) = n(w) Vv tontoy: 

Thus, in both cases, for arbitrary © € Mg(S"), n(b)Vto-t, and toato, > O-toy 


and therefore Clauses (1) and (2) are true in Mg. 


Theorem 4. Let p be a modal formula in simplified NNF. Let Pg = {10} : 
to} U pxe({0} : t, > p). If g is K-satisfiable, then p is KB-satisfiable. 


Proof. The main idea is to show that given a rooted tree K model of ®pg, its 

symmetric closure is a rooted KB model of vy. 

Let M = (W, R, V, wo) be a rooted tree K model of g. Let MP = (W, RB 

VB, wo) be a structure such that 

(a) R® is the symmetric closure of R, that is, RP is the smallest relation on W 

such that R C R® and for every v,w € W, v RB w implies w RP v 

(b) VP (p) = V (p) for every propositional symbol. 

Let Pk = {{0} : ty} U pk({0} : ty + p). We show that MP — Pg satisfies the 

three preconditions of Lemma 4. By Lemma 4 this in turn implies that M8 = y 

— Condition (4a) holds as R C RP. 

— For Condition (4b) let (3) S : toy —> On(w) be a modal clause in &g. 
Then g also contains the additional clauses (4) S~ : n(Y) V to-t,,, and 
(5) S~ : tanto, > Ortoy. Let w € M[S]. We have to show that (6) (MP, w) - 
toy > On(w). Assume (MP, w) H toy. As VP (toy) = V(toy) this implies 
(M,w) H toy. Let v € W such that w RP v. 

Case (a): Assume w Rv. As (M,w) = toy and (M,w) H toy > On(w), we 
have (M,w) = On(w). As w Rv, (M,v) H 7(w). As n(Ņ) is a literal and 
VP = V we obtain (M®,v) H 7(v). So, (MP, w) = toy > On(p). 

Case (b): Assume v is not reachable from w via R. Then wR8v was introduced 
by the symmetric closure operation on R and we must have v Rw. That is, 
v is a R-predecessor of w and from w € M[S] it follows that v € M[S7]. 
So, (7) (M,v) H n(Y) V toata, and (8) (M,v) H toato, > O-toy. From 
v Rw, (M,w) = toy and (8), it follows that (M, v) H} 7to-tp,,- This together 
with (7) implies (M,v) H n(Y). As n(w) is a literal and VË = V we obtain 
(M?,v) = n(Y). So, (MF, w) E toy > On(ty). 

Case (a) and Case (b) together show that Property (6) holds. 

— For Condition (4c) let (9) S : toy + Oty be in 8g, v, w € W, mlm (w) = ml € 
S (i.e., w € M[S]) and w RP v. We need to show that there exists a clause 
S’: “ty V ng (w) in g with v € M[S". 

As in the previous case w RP v implies either w Rv or v Rw. In the first case 
mla (v) = ml + 1 while in the second case mly (v) = ml — 1. 
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As Pg contains Clause (9), ty occurs at level ml+1 in 8g. By definition of pkg, 
&g also contains the clause (10) S~ : ty Vto-#,,,. As ml E S, ml—1 € S~ and 
therefore ty, also occurs at level ml—1 in g. By Theorem 2, ®g is definition- 
complete, so there must be a clause S’ : =ty V nf(Y) in Pg such that ml +1 
and ml — 1 in S’. 


Theorem 5. Lety be a modal formula in simplified NNF, L € {K, KB, KD, KT, 
K4, K5}, and Br = {410} : ty} U pL({0} : ty > p). Then y is L-satisfiable iff PL 
is K-satisfiable. 


5 Comparison With Related Work 


The approaches most closely related to ours are Kracht’s reductions of normal 
modal logics to basic modal logic [11,12], the global modal resolution calcu- 
lus [14], and Schmidt and Hustadt’s axiomatic translation principle for transla- 
tions of normal modal logics to first-order logic [24]. 

The first significant difference to our approach is that Kracht’s reductions 
and the axiomatic translation exclude the modal operator © from the language 
and only consider the modal operator 

In order to present Kracht’s abroad we need some additional notions. 
Let sf(y), dg(y), and |S| denote the set of all subformulae of y, the maximum 
nesting of modal operators in y, and the cardinality of the set S, respectively. 
Let O'y = O° = O<ty = y, OST = (Y A OOS), OPM = OO"Y, and 
Orttyy = OO"d. We can then define a reduction function p for a normal modal 
logic L in {KB, KD, KT, K4} as follows: 


p ADSIS PK (p), for L = K4 


K 
pL) = 
p \O<d%)+1 PK (p) otherwise 


where Pkg (%)= {74 > O-0y | O% € sf(y)} Pýo(%)= {7Ofalse} 


Pka(y)= {Ov > 00Ņ | OY € sf(y)}  Pkr(y)= {Ob > y | Ov € sf(y)} 
Kracht shows that ọ is L-satisfiable iff p(y) is K-satisfiable. There are three 
differences to our approach. First, PK(p) will include an axiom instance for 
every occurrence of a subformula —Oy, equivalent to On, in y. In contrast, 
our approach requires no logic specific treatment of such subformulae. Second, 
the use of O<” PK (y) in p$ means that the axiom instance is available at every 
modal level. This means, for example, that for 0; = ©100 (~p A Op), the formula 
p\(01) contains the axiom instance Op — p over 100 times, although it is only 
required at the level at which Op occurs. Third, this is further compounded if 
the formula w in Ow) is itself a complex formula. We try to avoid that by using 
a surrogate propositional symbol ty, instead, but this will only have a positive 
effect if the definitional clauses for ty do not have to be repeated. 

The global modal resolution (GMR) calculus operates on SNF, clauses, that 
is, clauses of the form 


*(start > V; lb) *(true > V= lb) *(l! — Ol) *(l! => 701) 
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[EUC1] *(1, 3 70-1) [EUC2] *(f > Bly) 
* (true >al V toi) * (toi > al) *(to1 > l2) * (toi >n =l) 
*(ato1 —> =l) * (toi —> toi) *(ato1 => =l) * (toi -> tor) 


Table 4. Inference rules in [14] for K5 (EUC1 and EUC2). 


* 


where 1, l’, lp are propositional literals with 1 < b < r, r € N, and is the uni- 
versal operator. The calculus has specific inference rules for normal modal logics 
such as KB, KD, KT, K4, K5. Table 4 shows the two additional rules for K5, the 
only logic for which there are rules for both O and —=0-, i.e., ©. These inference 
rules can be seen to perform an ‘on-the-fly’ computation of a transformation. 
Note that the clauses produced by Pxs differ from those produced by GMR for 
K5. Implicitly, our results here also show that it should be possible to eliminate 
EUC1 from the GMR calculus. 

For the axiomatic translation, we only present the function PRS that com- 
putes the logic dependent first-order clausal formulae that are part of the overall 
translation. 


= Y) = {¥2(-Qoy(y) V >R(2z,y) V Qy(2)) | Ov € sf(y)} 
RS (OW) = {Yz (Qoy (£) V Q-n-y(2)) | Op € sf(y)} 
KE (OY) = {Yz (Qo (x) V Qy(x)) | Od € sf(y)} 
a b) = {Vay(-Qoy(x) V =R(z, y) V Qoy (y)) | Od € sf(y)} 
Fs (OW) = {Yzy Qoy (y) V =R(x, y) V Qoy (2)), 
Vry(-Qo-oy(y) V =R(z, y) V Qo-oy(2)) | Op € sf(y)} 


The predicate symbols Qy, correspond to our surrogate symbols t. The clausal 
formulae used in the treatment of KT and K4 are translations of the SNF, 
clauses we use (or vice versa). KB and K5 are handled in a different way as the 
first-order clausal formulae refer directly the accessibility relation and can there- 
fore more easily express the transfer of information to a predecessor world. The 
universal quantification over worlds also means that the constraints expressed 
by the formulae hold at all modal levels without the need of any repetition. 

In Section 6 we will also use the relational and semi-functional translation 
of modal logics to first-order logic combined with structural transformation to 
clause normal form. In both approaches Ow is translated as Vry(—=Qoy(z) V 
AR(z, y)VQy, while Ow becomes YrIy(~Qoy(x)V R(x, y)) and Vzda(-Qoy(x)V 
R(x, |xa])) in the relational and semi-functional translation, respectively. Then, 
depending on the modal logics, further formulae representing the semantic prop- 
erties of the accessibility R are added. For the relational translation these will 
simply be the formulae in the fourth column of Table 1. The semi-functional 
translation uses collections of partial accessibility function in addition to the ac- 
cessibility relation. A predicate def is used to represent on which worlds a partial 
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accessibility function is defined. For each modal logic there is then again a back- 
ground theory consisting of formulae over def and R that represents the prop- 
erties of the underlying accessibility relation which is added to the translation 
of a formula. For example, for K5 the background theory is: VayVa((-def (x) V 
def (y)) A (def (wo) V R(wo, [woa])) A (def (x) V adef(y) V R([xa], [y8]))), where 
Wo is a constant representing the root world in a rooted Kripke structure. 


6 Evaluation 


We have compared the performance of the following approaches: (i) the com- 
bination of our reductions with the modal-layered resolution (MLR) calculus 
for SNF,,, clauses [15] implemented in the modal theorem prover KgP, with 
three different refinements for resolution inferences on labelled propositional 
clauses; (ii) the global modal resolution (GMR) calculus, also implemented in 
KoP, with three different refinements for resolution inferences on propositional 
clauses; (iii) the combinations of the relational and semi-functional translation of 
modal logics to first-order logic with ordered first-order resolution implemented 
in the first-order theorem prover SPASS. In total this gives us eight different 
approaches to compare. The axiomatic translation is currently not implemented 
in SPASS. Other provers, such as LEO-III [26], LWB [9], MleanCoP [21], do not 
have built-in support for the full range of logics considered here. LoOTREC 2.0 [7] 
supports all the logics, but is not intended as automatic theorem prover. 

The modal-layered resolution calculus operates on SNF, „z clauses, that is, 
clauses of the form 


ml : V; lb ml: > Ol ml: l > ol 


where ml € NU {x} and J, I’, ly are propositional literals with 1 < b < r, 
r € N. In the implementation of the reductions presented in Section 3, we take 
a SNF „mı clause S : % simply as an abbreviation of the set of SNF, clauses 
{ml : | ml € S}. Note that this also means that we will have to repeat similar 
resolution inferences for different modal levels. 

KgP [13] implements the reductions presented in Section 3 as well as a normal 
form transformation of modal formulae to sets of SNF, clauses. It implements 
both the MLR and the GMR calculus. Resolution inferences between (labelled) 
propositional clauses can either be unrestricted (cplain option), restricted by 
an ordering (cord option), that is, clauses can only be resolved on their maximal 
literals with respect to an ordering chosen by the prover in such a way to preserve 
completeness, restricted to negative resolution (cneg option), that is, one of the 
premises in an inference has to be a negative clause, or restricted to positive 
resolution. We do not include the last option in our evaluation as it typically 
performs worse. KgP also implements a range of simplification rules that are 
applied to modal formulae before their transformation to normal form. Of those 
we have enabled pure literal elimination (early_ple option), simplification using 
the Box Normal Form [22] and Prenex Normal Form (bnfsimp and prenex 


Efficient Local Reductions to Basic Modal Logic 89 


Logic|Status| Total KSP| KSP| KSP KSP| KSP| KSP||SPASS|SPASS 
(GMRI (GMRI (GMR|| (MLR| (MLR| (MLR/| (semi-| (rela- 


calcu-| calcu-| calcu-|| calcu-| calcu-| calcu-|| func-| tional) 
lus, lus, lus, lus, lus, lus, || tional) 

cneg)} cord)| cplain)|| cneg)) cord)| cplain) 
K Sat 180 110 139 93 141 155 132 92 97 
K Unsat| 180 154 156 151 154| 156 153 134 122 
KD |Sat 180 125 143 118 141 155 133 107 103 
KD |Unsat| 180 154 156 151 154| 156 153 136 130 
KT [Sat 100 53 60 37 46 56 26 47 39 
KT |Unsat| 260 233 236 225 230| 238 220 222 199 
KB [Sat 122 28 35 41 49 89 22 31 23 
KB |Unsat} 238 186 196 197 207| 211 205 159 169 
K4 [Sat 161 33 39 38 68| 125 36 0 0 
K4 |Unsat| 199 124 112 146 168 165 163 109 35 
K5 [Sat 60 14 10 9 T 10 4 7 0 
K5  |Unsat | 300 251 246 259 255 254 246 255 124 
All [Sat 803 363 426 336 452| 590 353 284 262 

U 


All nsat | 1357 1102| 1102| 1129 1168| 1180| 1140 1015 779 


Table 5. Experimental results on LWB benchmark collection 


options) [17]. For clause processing, unit resolution and pure elimination are 
enabled (unit, lhs_unit, and ple options). 

SPASS 3.9 [27,28] supports automated reasoning in extended modal logics, 
including all logics considered here, PDL-like modal logics as well as descrip- 
tion logics. It includes eight different translations of modal logics to first-order 
logic. In our evaluation we have used the relational translation and the semi- 
functional translation. For the local satisfiability problem in KB to K5, for the 
relational translation we have added the first-order frame properties given in 
Table 1 while for the semi-functional translation we have added the background 
theories devised by Nonnengart [20]. For the transformation to first-order clausal 
form, we have enabled renaming of quantified subformulae. The only inference 
rules used are ordered resolution and ordered factoring, the reduction rules used 
are condensing, backward subsumption and forward subsumption. For the rela- 
tional and semi-functional translation for K, KB, KD, and KT we thereby obtain 
a decision procedure, while for the other logics we do not. For K4 and K5, the 
fragment of first-order clausal logic corresponding to the semi-functional trans- 
lation of modal formula and their background theories is decidable by ordered 
resolution with selection [25]. However, the non-trivial ordering and selection 
function required is not currently implemented in SPASS. 


For our evaluation we have chosen the LWB basic modal logic benchmark 
collection [2], with 20 formulae in each of 18 parameterised classes. For K, all 
formulae in 9 classes are satisfiable while all formulae in the other 9 classes are 
unsatisfiable. In their negation normal form, 63% of modal operators are O and 
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37% are © operators. We have used the collection for each of the six logics. If a 
formula is unsatisfiable in K then it remains unsatisfiable in the other five logics, 
while the opposite is not true. As we move to logics other than K, it is also no 
longer the case that all formulae in a class have the same satisfiability status. 

The third column in Table 5 indicates the total number of satisfiable and 
unsatisfiable formulae for each logic. In the last two lines of the table we sum 
up the results for all logics. The last eight columns in the table show how many 
formulae each of the approaches were able to solve with a time limit of 100 CPU 
seconds for each formula. Benchmarking was performed on a PC with an AMD 
Ryzen 5 5600X CPU @ 4.60GHz max and 32GB main memory using Fedora 
release 33 as operating system. 

As we can see, the new reductions combined with the modal-layered reso- 
lution (MLR) calculus and ordered resolution refinement (cord) perform best, 
achieving the highest number of solved formulae in 8 out of 12 individual cat- 
egories in the table, on two of those equal with the global modal resolution 
(GMR) calculus. On 3 categories, GMR outperfoms MLR. On both satisfiable 
and unsatisfiable formulae in K5 this can be seen as evidence that ‘on-the-fly’ 
transformation offers a (slight) advantage over our approach given that the ad- 
ditional clauses hold universally in both approaches. For SPASS we see a clear 
advantage of the semi-functional translation over the relational one, on both 
satisfiable and unsatisfiable formulae. 


7 Conclusion and Future Work 


We have presented new reductions of propositional modal logics KB, KD, KT, 
K4, K5 to Separated Normal Form with Sets of Modal Levels. We have shown 
experimentally that these reductions allow us to reason effectively in these logics. 

The obvious next step is to consider extensions of the basic modal logic K 
with combinations of the axioms B, D, T, 4, and 5. Unfortunately, a simple 
combination of the reductions for each of the axioms is not sufficient to obtain a 
satisfiability-preserving reduction for the such modal logics. An example is the 
simple formula =p A ©OOp which is KB4-unsatisfiable. If we define 


Pxepa(S : toy > wv) = Pxe(S : toy > p) U Pk4(S : toy > wv) 
Axes(S : toy > OW) = kea (x, Y), 


that is, Pkg, is the union of Pkg and Px, then the clause set obtained from 
{{0} : to} U pxea({0} : to > ap A OOO p) is K-satisfiable. The same issue also 
occurs in the axiomatic translation of modal logics to first-order logic where the 
translation for KB4 is not simply the combination of the translations for KB and 
K4 [24, Theorem 5.6]. We are currently exploring solutions to this problem. 

Regarding practical applications, it would be advantageous to have an im- 
plementation of a calculus that operates directly SNF „z clauses. This would 
greatly reduce the number of inference steps performed on satisfiable formulae 
and simplify proof search in general. Again, such an implementation is future 
work. 
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Abstract. Isabelle is a generic theorem prover with a fragment of higher- 
order logic as a metalogic for defining object logics. Isabelle also provides 
proof terms. We formalize this metalogic and the language of proof terms 
in Isabelle/HOL, define an executable (but inefficient) proof term checker 
and prove its correctness w.r.t. the metalogic. We integrate the proof 
checker with Isabelle and run it on a range of logics and theories to 
check the correctness of all the proofs in those theories. 


1 Introduction 


One of the selling points of proof assistants is their trustworthiness. Yet in prac- 
tice soundness problems do come up in most proof assistants. Harrison [11] 
distinguishes errors in the logic and errors in the implementation (and cites ex- 
amples). Our work contributes to the solution of both problems for the proof 
assistant Isabelle [31]. Isabelle is a generic theorem prover: it implements M, a 
fragment of intuitionistic higher-order logic, as a metalogic for defining object 
logics. Its most developed object logic is HOL and the resulting proof assistant 
is called Isabelle/HOL [25,24]. The latter is the basis for our formalizations. 

Our first contribution is the first complete formalization of Isabelle’s meta- 
logic. Thus our work applies to all Isabelle object logics, e.g. not just HOL but 
also ZF. Of course Paulson [30] describes M precisely, but only on paper. More 
importantly, his description does not cover polymorphism and type classes, which 
were introduced later [26]. The published account of Isabelle’s proof terms [4] is 
also silent about type classes. Yet type classes are a significant complication (as, 
for example, Kunéar and Popescu [18] found out). 

Our second contribution is a verified (against M) and executable checker for 
Isabelle’s proof terms. We have integrated the proof checker with Isabelle. Thus 
we can guarantee that every theorem whose proof our proof checker accepts is 
provable in our definition of M. So far we are able to check the correctness of 
moderatly sized theories across the full range of logics implemented in Isabelle. 

Although Isabelle follows the LCF-architecture (theorems that can only be 
manufactured by inference rules) it is based on an infrastructure optimized for 
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performance. In particular, this includes multithreading, which is used in the ker- 
nel and has once lead to a soundness issue! . Therefore we opt for the “certificate 
checking” approach (via proof terms) instead of verifying the implementation. 

This is the first work that deals directly with what is implemented in Isabelle 
as opposed to a study of the metalogic that Isabelle is meant to implement. In- 
stead of reading the implementation you can now read and build on the more 
abstract formalization in this paper. The correspondence of the two can be es- 
tablished for each proof by running the proof checker. 

Our formalization reflects the ML implementation of Isabelle’s terms and 
types and some other data structures. Thus a few implementation choices are 
visible, e.g. De Bruijn indices. This is necessary because we want to integrate 
our proof checker as directly as possible with Isabelle, with as little unverified 
glue code as possible, for example no translation between De Bruijn indices and 
named variables. We refer to this as our intentional implementation bias. In prin- 
ciple, however, one could extend our formalization with different representations 
(e.g. named terms) and prove suitable isomorphisms. Our work is purely proof 
theoretic; semantics is out of scope. 

The formalization can be found in the Archive of Formal Proofs/28]. 


2 Related Work 


Harrison [11] was the first to verify some of HOL’s metatheory and an imple- 
mentation of a HOL kernel in HOL itself. Kumar et al. [13] formalized HOL 
including definition principles, proved its soundness and synthesized a verified 
kernel of a HOL prover down to the machine language level. Abrahamsson [2] 
verified a proof checker for the OpenTheory [12] proof exchange format for HOL. 

Wenzel [38] showed how to interpret type classes as predicates on types. We 
follow his approach of reflecting type classes in the logic but cannot remove them 
completely because of our intentional implementation bias (see above). Kunéar 
and Popescu [15,16,17,18] focus on the subtleties of definition principles for HOL 
with overloading and prove that under certain conditions, type and constant 
definitions preserve consistency. Aman Pohjola et al. [1] formalize [15,18]. 

Adams [3] presents HOL Zero, a basic theorem prover for HOL that addresses 
the problem of how to ensure that parser and pretty-printer do not misrepresent 
formulas. 

Let us now move away from Isabelle and HOL. Sozeau et al. [36] present the 
first implementation of a type checker for the kernel of Coq that is proved correct 
in Coq with respect to a formal specification. Carneiro [6] has implemented a 
highly performant proof checker for a multi-sorted first order logic and is in the 
process of verifying it in its own logic. 

We formalize a logic with bound variables, and there is a large body of related 
work that deals with this issue (e.g. [37,21,7]) and a range of logics and systems 
with special support for handling bound variables (e.g. [33,34,35]). We found 
that De Bruijn indices worked reasonably well for us. 


A https://mailmanbroy.in.tum.de/pipermail/isabelle-dev/2016-December/007251. html 
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3 Preliminaries 


Isabelle types are built from type variables, e.g. ‘a, and (postfix) type construc- 
tors, e.g. ‘a list; the function type arrow is =. Isabelle also has a type class 
system explained later. The notation t :: 7 means that term t has type T. Isa- 
belle/HOL provides types ‘a set and ‘a list of sets and lists of elements of type 
‘a. They come with the following vocabulary: function set (conversion from lists 
to sets), (#) (list constructor), (@) (append), |zs| (length of list xs), zs ! i (the 
ith element of zs starting at 0), list-all2 p [a1, ..., Em] [Y1; -n Yn] = (m= 7 
Apt, Y1 A... A p Tn Yn) and other self-explanatory notation. 

The Field of a relation r is the set of all x such that (a,_) or (_,x) is in r. 

There is also the predefined data type 


datatype ‘a option = None | Some ‘a 


The type Tı — T2 abbreviates Tı = T2 option, i.e. partial functions, which we 
call maps. Maps have a domain and a range: 


dom m = {a | m a # None} ranm = {b | da. ma = Some b}. 


Logical equivalence is written = instead of <—. 


4 Types and Terms 


A name is simply a string. Variables have type var; their inner structure is 
immaterial for the presentation of the logic. 

The logic has three layers: terms are classified by types as usual, but in 
addition types are classified by sorts. A sort is simply a set of class names. We 
discuss sorts in detail later. 

Types (typically denoted by T, U, ...) are defined like this: 


datatype typ = Ty name (typ list) | Tv var sort 


where Ty « [74,...,T'n] represents the Isabelle type (T1,...,T7n) k and Tva S$ 
represents a type variable a of sort S — sorts are directly attached to type 
variables. The notation T — U is short for Ty “fun” [T,U], where ”fun” is the 
name of the function type constructor. 

Isabelle’s terms are simply typed lambda terms in De Bruijn notation: 


datatype term = Ct name typ | Fv var typ | Bv nat | Abs typ term | (-) term term 


A term (typically r, s, t, u...) can be a typed constant Ct c T or free variable 
Fv v T, a bound variable Bv n (a De Brujin index), a typed abstraction Abs T t 
or an application t + u. 

The term-has-type proposition has the syntax Ts F» t : T where Ts is a list 
of types, the context for the type of the bound variables. 


i < |Ts| 
Ts F, Bvi: Ts!i 


aF Ch. Ts f Fe Pu T iT 
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T#Tst+,t: T 
Ts +, Abs Tt: T T’ 
Ts Fett U Ts, t: U3 T 
Ts Ftu: T 


We define +, t: T = || F- t: T. 

Function fv :: term = (var x typ) set collects the free variables in a term. 
Because bound variables are indices, fv t is simply the set of all (v, T) such that 
Fv v T occurs in t. The type is an integral part of a variable. 

A type substitution is a function @ of type var = sort => typ. It assigns a type 
to each type variable and sort pair. We write o $$ T or o $$ t for the overloaded 
function which applies such a type substitution to all type variables (and their 
sort) occurring in a type or term. The type instance relation is defined like this: 


Ty Ss T2 = (de. o $$ To. = T1) 


We also need to -contract a term Abs T t - u to something like “t with 
Bv 0 replaced by u”. We define a function subst-bv such that subst-bv u t is 
that 6-contractum. The definition of subst-bv is shown in the Appendix and can 
also be found in the literature (e.g. [23]). 

In order to abstract over a free (term) variable there is a function bind-fv (v, 
T) t that (roughly speaking) replaces all occurrences of Fv v T in t by Bv 0. 
Again, see the Appendix for the definition. This produces (if Fv v T occurs in 
t) a term with an unbound Bv 0. Function Abs-fv binds it with an abstraction: 


Abs-fv v T t = Abs T (bind-fv (v, T) t) 


While this section described the syntax of types and terms, they are not 
necessarily wellformed and should be considered pretypes/preterms. The well- 
formedness checks are described later. 


5 Classes and Sorts 


Isabelle has a built-in system of type classes [22] as in Haskell 98 except that 
class constraints are directly attached to variable names: our Tv a [C,D,...] 
corresponds to Haskell’s (C a, Da, ...) =>... a... 

A sort is Isabelle’s terminology r a set of (class) names, e.g. {C,D,...}, 
which represent a conjunction of class constraints. In our work, variables 5, 9 
etc. stand for sorts. 

Apart from the usual application in object logics, type classes also serve an 
important metalogical purpose: they allow us to restrict, for example, quantifi- 
cation in object logics to object-level types and rule out meta-level propositions. 

Isabelle’s type class system was first presented in a programming language 
context [29,27]. We give the first machine-checked formalization. The central 
data structure is a so-called order-sorted signature. Intuitively, it is comprised 
of a set of class names, a partial subclass ordering on them and a set of type 
constructor signatures. A type constructor signature « :: (S1, ..., Sk) c fora 
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type constructor «K states that applying « to types T1, ..., Tk such that T; has 
sort S; (defined below) produces a type of class c. Formally: 


type_synonym osig = ((name x name) set x (name — (class — sort list))) 


To explain this formalization we start from a pair (sub,tcs) :: osig and recover 
the informal order-sorted signature described above. The set of classes is simply 
the Field of the sub relation. The tcs component represents the set of all type 
constructor signatures & :: (Ss) c (where Ss is a list of sorts) such that tcs k = 
Some dm and dm c = Some Ss. Representing « :: (Ss) c as a triple, we define 


TCS = {(k, Ss, c) | ddomf. tcs k = Some domf ^ domf c = Some Ss} 


TCS is the translation of tcs, the data structure close to the implementation, 
to an equivalent but more intuitive version TCS that is close to the informal 
presentations in the literature. 

The subclass ordering sub can be extended to a subsort ordering as follows: 


Sı Soup S2 = (Y c2€S2. dc1€$1. c1 Iguh €2) 


The smaller sort needs to subsume all the classes in the larger sort. In particular 
{ci} Seah {co} iff (c1, C2) € sub. 

Now we can define a predicate has-sort that checks whether, in the context 
of some order-sorted signature (sub,tcs), a type fulfills a given sort constraint: 
S < sub S’ 
has-sort (sub, tcs) (Tv a S) S’ 


tes k = Some dm 
VceS. 15s. dm c = Some Ss A list-all2 (has-sort (sub, tcs)) Ts Ss 


has-sort (sub, tcs) (Ty k Ts) S 


The rule for type variables uses the subsort relation and is obvious. A type (T1, 
.., Tn) K has sort {c1, ...} if for every c; there is a signature « :: (S1, ..., Sn) 
ci and has-sort (sub, tes) Tj Sj for j = 1,..., n. 
We normalize a sort by removing “superfluous” class constraints, i.e. retain- 
ing only those classes that are not subsumed by other classes. This gives us 
unique representatives for sorts which we call normalized: 


normalize-sort sub S = {c € S | ~ (a c'es. (c’, c) € sub A (c, c") € sub)} 
normalized-sort sub S = (normalize-sort sub S = S) 


We work with normalized sorts because it simplifies the derivation of efficient 
executable code later on. 
Now we can define wellformedness of an osig: 


wf-osig (sub, tes) = (wfsubclass sub ^ wf-tcsigs sub tcs) 


A sublass relation is wellformed if it is a partial order where reflexivity is re- 
stricted to its Field. Wellformedness of type constructor signatures (wftcsigs) is 
more complex. We describe it in terms of TOS derived from tcs (see above). The 
conditions are the following: 
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— The following property requires a) that for any & :: (...)c, there must be a 
K :: (...)C2 for every superclass c2 of cı and b) coregularity which guarantees 
the existence of principal types [29,10]. 

V(kK, $51, «1 )ETCS. 
V c2. (c1, €2) E€ sub — 

(2 Sso. (k, Sso, C2) € TCS A list-all2 (AS, So. Sy Ssub S2) Ss1 S82) 

— A type constructor must always take the same number of argument types: 
Vk Ss4 Cy Sso C2. 

(k, Ss1, c1) E TOS A (K, Ss2, co) E€ TCS — |Ss1| = |Sso| 

— Sorts must be normalized and must exists in sub: 

V (kK, Ss, c)ETOCS. Y SEset Ss. wf-sort sub S$ 
where wf-sort sub S = (normalized-sort sub S A S C Field sub) 


These conditions are used in a number of places to show that the type system 
is well behaved. For example, has-sort is upward closed: 


wf-osig (sub, tes) A has-sort (sub, tes) TS A S <sub 8’ 
—> has-sort (sub, tes) TS’ 


6 Signatures 


A signature consist of a map from constant names to their (most general) types, a 
map from type constructor names to their arities, and an order-sorted signature: 


type_synonym signature = (name — typ) x (name — nat) x osig 


The three projection functions are called const-type, type-arity and osig. We now 
define a number of wellformedness checks w.r.t. a signature X. We start with 
wellformedness of types, which is pretty obvious: 
type-arity X k = Some |Ts| V Teset Ts. wf-type X T 
wf-type X (Ty & Ts) 
wf-sort (subclass (osig X)) S 
wf-type X (Tv aS) 


Wellformedness of a term essentially just says that all types in the term are 
wellformed and that the type T’ of a constant in the term must be an instance 
of the type T of that constant in the signature: T’ < T. 

wf-type X T 
wf-term X (Fv v T) 
const-type X s = Some T wf-type X T' T'<T 
wf-term X (Ct s T’) 
wf-term X t wf-term X u 
wf-term X (t + u) 
wf-type X T wf-term X t 
wf-term X (Abs T t) 


wf-term X (Bv n) 
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These rules only check whether a term conforms to a signature, not that the 
contained types are consistent. Combining wellformedness and +, yields well- 
typedness of a term: 


wi-term X t = (wf-term X t A (AT. +, t: T)) 


Wellformedness of a signature X = (ctf, arf, oss) where oss = (sub, tcs) is 
defined as follows: 


wf-sig X = 
((Y TEran ctf. wf-type X T) ^ wf-osig oss A dom tcs = dom arf ^ 
(Vk dm. tes k = Some dm —> (Y SsEran dm. arf k = Some |Ss|))) 


In words: all types in ctf are wellformed, oss is wellformed, the type constructors 
in tcs are exactly those that have an arity in arf, for every type constructor 
signature (K, Ss, _) in tcs, «x has arity |Ss|. 


7 Logic 


Isabelle’s metalogic M is an extension of the logic described by Paulson [30]. It 
is a fragment of intuitionistic higher-order logic. The basic types and connectives 
of M are the following: 


Concept Representation Abbreviation 
Type of propositions|Ty “prop” [| prop 
Implication Ct “imp” (prop — prop — prop) => 
Universal quantifier |Ct ’all” ((T — prop) — prop) Ar 
Equality Ct ’eq” (T > T > prop) =r 


The type subscripts of /\ and = are dropped in the text if they can be inferred. 

Readers familiar with Isabelle syntax must keep in mind that for readability 
we use the symbols A, = > and = for the encodings of the respective symbols 
in Isabelle’s metalogic. We avoid the corresponding metalogical constants com- 
pletely in favour of HOL’s Y, —>, = and inference rule notation. 

The provability judgment of M is of the form 0,1’ + t where © is a theory, 
I (the hypotheses) is a set of terms of type prop and t a term of type prop. 

A theory is a pair of a signature and a set of axioms: 


type_synonym theory = signature x term set 


The projection functions are sig and axioms. We extend the notion of wellformed- 
ness from signatures to theories: 


wf-theory (X, ars) = 
(wésig X A (Y pears. wt-term X p ^A ++ p : prop) A^ is-std-sig X ^ eq-axs C azs) 


The first two conjuncts need no explanation. Predicate is-std-sig (not shown) 
requires the signature to have certain minimal content: the basic types (—, prop) 
and constants (=, A, ==>) of M and the additional types and constants for type 
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class reasoning from Section 7.3. Our theories also need to contain a minimal set 
of axioms. The set eq-axs is an axiomatic basis for equality reasoning and will 
be explained in Section 7.2. 

We will now discuss the inference system in three steps: the basic inference 
rules, equality and type class reasoning. 


7.1 Basic Inference Rules 


The axiom rule states that wellformed type-instances of axioms are provable: 


wf-theory O t € axioms O wf-inst O ọ 
OT + o $$ t 


where @ :: var => sort = typ is a type substitution and $$ denotes its applica- 
tion (see Section 4). The types substituted into the type variables need to be 
wellformed and conform to the sort constraint of the type variable: 


wf-inst (X, avs) 9 = 
(Vv S. o v S 4 Tvu S —+ has-sort (osig X) (o v S) S A wf-type X (o v S)) 


The conjunction only needs to hold if o actually changes something, i.e. if o v S 
# Tv v S. This condition is not superfluous because otherwise has-sort oss (Tv 
v 5) S and wf-type X (Tv v S) only hold if S is wellformed w.r.t X. 

Note that there are no extra rules for general instantiation of type or term 
variables. Type variables can only be instantiated in the axioms. Term instanti- 
ation can be performed using the forall introduction and elimination rules. 

The assumption rule allows us to prove terms already in the hypotheses: 


wf-term (sig O) t H- t : prop ter 
Orri 


Both A and => are characterized by introduction and elimination rules: 


wf-theory O Ort (z, T) FVT wf-type (sig O) T 
OD + Nr (Abs-fv x T t) 


OFF Nr (Abs T t) Pe eae! & wf-term (sig O) u 
O,I’ F subst-bv u t 
wf-theory O Orr u wf-term (sig O) t H- t: prop 
O, r — {t} F t = u 
Əri F t= ü Ə Tz Ft 
OTlTiUTsFu 


where FV r = (Utep fv t). 
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7.2 Equality 


Most rules about equality are not part of the inference system but are axioms 
(the set eq-axs mentioned above). Consequences are obtained via the axiom rule. 
The first three axioms express that = is reflexive, symmetric and transitive: 


Tr T =Y = YIT t = Y = Y Zz => TZIZ 


The next two axioms express that terms of type prop (A and B) are equal iff 
they are logically equivalent: 


A = B = A = B (A B) (B A) A=B 


The last equality axioms are congruence rules for application and abstraction: 


f =g = 1 = y = (f - 1) = (g - y) 
A (Abs T ((f - Bv 0) = (g + Bv 0))) => Abs T (f - Bv 0) = Abs T (g - Bv 0) 


Paulson [30] gives a slightly different congruence rule for abstraction, which 
allows to abstract over an arbitrary, free x in f,g. We are able to derive this rule 
in our inference system. 

Finally there are the lambda calculus rules. There is no need for œ conversion 
because a-equivalent terms are already identical thanks to the De Brujin indices 
for bound variables. For 6 and 7 conversion the following rules are added. In 
contrast to the rest of this subsection, these are not expressed as axioms. 


wf-theory O 

wt-term (sig O) (Abs T t) wf-term (sig O) u Feu: T 
OI + (Abs Tt + u) = subst-bv u t 

wf-theory O wf-term (sig O) t Pie rage 

O, + Abs T (t- Bv 0) =t ( 


n) 


Rule (8) uses the substitution function subst-bv as explained in Section 4 (and 
defined in the Appendix). 

Rule (7) requires a few words of explanation. We do not explicitly require 
that t does not contain Bv 0. This is already a consequence of the precondition 
that F+ t: T — T': it implies that t is closed. For that reason it is perfectly 
unproblematic to remove the abstraction above t. 


7.3 Type Class Reasoning 


Wenzel [38] encoded class constraints of the form “type T has class c” in the 
term language as follows. There is a unary type constructor named “itself” and 
T itself abbreviates Ty “itself” [T]. The notation TYPE T itself is short for 
Ct “type” (T itself) where “type” is the name of a new uninterpreted constant. 
You should view TYPE T itself aS the term-level representation of type T. 

Next we represent the predicate “is of class c” on the term level. For this we 
define some fixed injective mapping const-of-class from class to constant names. 
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For each new class c a new constant const-of-class c of type T itself — prop is 
added. The term Ct (const-of-class c) (T itself + prop) + TYPE T jtceif represents 
the statement “type T has class c”. This is the inference rule deriving such 
propositions: 


wf-theory O 
const-type (sig ©) (const-of-class C) = Some ('a itself — prop) 
wf-type (sig O) T has-sort (osig (sig O)) T {C} 
O,I + Ct (const-of-class C) (T itself + prop) + TYPE T itself 


This is how the has-sort inference system is integrated into the logic. 


This concludes the presentation of M. We have shown some minimal sanity 
properties, incl. that all provable terms are of type prop and wellformed: 


Theorem 1. 0, + t —>+, t: prop ^ wéterm (sig O) t 


The attentive reader will have noticed that we do not require unused hy- 
potheses in I’ to be wellformed and of type prop. Similarly, we only require 
wf-theory © in rules that need it to preserve wellformedness of the terms and 
types involved. To restrict to wellformed theories and hypotheses we define a 
top-level provability judgment that requires wellformedness: 


O, + t = (wétheory O A (VhEL. wf-term (sig O) h AF, h: prop) ^O, Ft) 


8 Proof Terms and Checker 


Berghofer and Nipkow [4] added proof terms to Isabelle. We present an ex- 
ecutable checker for these proof terms that is proved sound w.r.t. the above 
formalization of the metalogic. Berghofer and Nipkow also developed a proof 
checker but it was unverified and checked the generated proof terms by feeding 
them back through Isabelle’s unverified inference kernel. 

It is crucial to realize that all we need to know about the proof term checker 
is the soundness theorem below. The internals are, from a soundness perspective, 
irrelevant, which is why we can get away with sketching them informally. This 
is in contrast to the logic itself, which acts like a specification, which is why we 
presented it in detail. 

This is our data type of proof terms: 


datatype proofterm = PAxm term (((var x sort) x typ) list) | PBound nat 
| Abst typ proofterm | AbsP term proofterm | Appt proofterm term 
| AppP proofterm proofterm | OfClass typ name | Hyp term 


These proof terms are not designed to record proofs in our inference system, but 
to mirror the proof terms generated by Isabelle. Nevertheless, the constructors 
of our proof terms correspond roughly to the rules of the inference system. PAxm 
contains an axiom and a type substitution. This substitution is encoded as an 
association list instead of a function. AbsP and Abst correspond to introduction 
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of = > and A, AppP and Appt correspond to the respective eliminations. Hyp 
and PBound relate to the assumption rule, where Hyp refers to a free assumption 
while PBound contains a De Brujin index referring to an assumption added 
during the proof by an AbsP constructor. OfClass denotes a proof that a type 
belongs to a given type class. 

Isabelle looks at terms modulo a(7-equivalence and therefore does not save 
6B or 7 steps, while they are explicit steps in our inference system. Therefore 
we have no constructors corresponding to the (8) and (7) rules. The remaining 
equality axioms are naturally handled by the PAxm constructor. 

In the rest of the section we discuss how to derive an executable proof checker. 
Executability means that the checker is defined as a set of recursive functions that 
Isabelle’s code generator can translate into one of a number of target languages, 
in particular its implementation language SML [5,9,8]. 

Because of the approximate correspondence between proof term constructors 
and inference rules, implementing the proof checker largely amounts to providing 
executable versions of each inference rule, as in LCF: each rule becomes a func- 
tion that checks the side conditions, and if they are true, computes the conclusion 
from the premises given as arguments. The overall checker is a function 


replay :: theory = proofterm = term option 


In particular we need to make the inductive wellformedness checks for sorts, types 
and terms, signatures and theories executable. Mostly, this amounts to providing 
recursive versions of inductive definitions and proving them equivalent. 

We now discuss some of the more difficult implementation steps. To model 
Isabelle’s view of terms modulo a8n-equivalence, we 37 normalize our terms (a- 
equivalence is for free thanks to De Brujin notation) during the reconstruction 
of the proof. A lengthy proof shows that this preserves provability (we do not 
go into the details): 


wf-theory © A finite I ^A (Y AET. wt-term (sig O) A A F- A: prop N OLE tA 
beta-eta-norm t = Some u — O, F u 


Isabelle’s code generator needs some help handling the maps used in the (order- 
sorted) signatures. We provide a refinement of maps to association lists. Another 
problematic point is the definition of the type instance relation (S<), which con- 
tains an (unbounded) existential quantifier. To make this executable, we provide 
an implementation which tries to compute a suitable type substitution. In an- 
other step, we refine the type substitution to an association list as well. 


In the end we obtain a proof checker 
check-proof © P p = (wf-theory © ^ replay O P = Some p) 


that checks theory © and checks if proof P proves the given proposition p. The 
latter check is important because the Isabelle theorems that we check contain 
both a proof and a proposition that the theorem claims to prove. Function check- 
proof checks this claim. As one of our main results, we can prove the correctness 
of our checker: 
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Theorem 2. check-proof O P p —+ O,set (hyps P) F p 


The proof itself is conceptually simple and proceeds by induction over the struc- 
ture of proof terms. For each proof constructor we need to show that the corre- 
sponding inference rule leads to the same conclusion as its functional version used 
by replay. Most of the proof effort goes into a large library of results about terms, 
types, signatures, substitutions, wellformedness etc. required for the proof, most 
importantly the fact that 67 normalization preserve provability. 


9 Size and Structure of the Formalization 


All material presented so far has been formalized in Isabelle/HOL. The definition 
of the inference system (incl. types, terms etc.) resides in a separate theory Core 
that depends only on the basic library of Isabelle/HOL. It takes about 300 LOC 
and is fairly high level and readable — we presented most of it. This is at least an 
order or magnitude smaller than Isabelle’s inference kernel (which is not clearly 
delineated) — of course the latter is optimized for performance. Its abstract type 
of theorems alone takes about 2,500 LOC, not counting any infrastructure of 
terms, types, unification etc. 

The whole formalization consists of 10,000 LOC. The main components are: 


— Almost half the formalization (4,700 LOC) is devoted to providing a library 
of operations on types and terms and their properties. This includes, among 
others, executable functions for type checking, different types of substitu- 
tions, abstractions, the wellformedness checks and 8 and 7 reductions. 

— Proving derived rules of our inference system takes up 3,000 LOC. A large 
part of this is deriving rules for equality and the 8 and 7 reductions. Weak- 
ening rules are also derived. 

— Making the wellformedness checks for (order-sorted) signatures and theories 
as well as the type instance checks executable takes 1,800 LOC. 

— Definition and correctness proof for the checker builds on the above material 
and take only about 500 additional LOC. 


10 Integration with Isabelle 


As explained above, Isabelle generates SML code for the proof checker. This 
code has its own definitions of types, terms etc. and needs to be interfaced with 
the corresponding data structures in Isabelle. This step requires 150 lines of 
handwritten SML code (glue code) that translates Isabelle’s data structures into 
the corresponding data structures in the generated proof checker such that we 
can feed them into check-proof. We cannot verify this code and therefore aim 
to keep it as small and simple as possible. This is the reason for the previously 
mentioned intentional implementation bias we introduced in our formalization. 
We describe now how the various data types are translated. We call a translation 
trivial if it merely replaces one constructor by another, possibly forgetting some 
information. 
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The translation of types and terms is trivial as their structure is almost 
identical in the two settings. For Isabelle code experts it should be mentioned 
that the two term constructors Free and Var in Isabelle (which both represent 
free variables but Var can be instantiated by unification) are combined in type 
var of the formalization which we left unspecified but which in fact looks like 
this: datatype var = Free name | Var indexname. This is purely to trivialize 
the glue code, in our formalization var is totally opaque. 

Proof term translation is trivial except for two special cases. Previously 
proved lemmas become axioms in the translation (see also below) and so-called 
“oracles” (typically the result of unfinished proofs, i.e. “sorry” on the user level) 
are rejected (but none of the theories we checked contain oracles). Also remem- 
ber that the translation of proofs is not safety critical because all that matters 
is that in the end we obtain a correct proof of the claimed proposition. 

We also provide functions to translate relevant content from the background 
theory: axioms and (order-sorted) signatures. This mostly amounts to extracting 
association lists from efficient internal data structures. Translating the axioms 
also involves translating some alternative internal representation of type class 
constraints into their standard form presented in Sect. 7.3. 

The checker is integrated into Isabelle by calling it every time a new named 
theorem has been proved. The set of theorems proved so far is added to the ax- 
iomatic basis for this check. Cyclic dependencies between lemmas are ruled out 
by this ordering because every theorem is checked before being added to the ax- 
iomatic basis. However, an explicit cyclicity check is not part of the formalization 
(yet), which speaks only about checking single proofs. 


11 Running the Proof Checker 


We run this modified Isabelle with our proof checker on multiple theories in 
various object logics contained in the Isabelle distribution. A rough overview 
of the scope of the covered material for some logics and the required running 
times can be found in the following table. The running times are the total times 
for running Isabelle, not just the proof checking, but the latter takes 90% of 
the time. All tests were performed on a Intel Core i7-9750H CPU running at 
2.60GHz and 32GB of RAM. 


Logic LOC Time 

FOL 4,500 45 secs 
ZF 55,000 25 mins 
HOL 10,000 26 mins 


We can check the material in several smaller object logics in their entirety. 
One of the larger such logics is first-order logic (FOL). These logics do not de- 
velop any applications but FOL comes with proof automation and theories test- 
ing that automation, in particular Pelletier’s collection of problems that were 
considered challenges in their day [32]. Because the proofs are found automat- 
ically, the resulting proof terms will typically be quite complex and good test 
material for a proof checker. 
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The logic ZF (Zermelo-Fraenkel set theory) builds on FOL but contains real 
applications and is an order of magnitude larger than FOL. We are able to check 
all material formalized in ZF in the Isabelle distribution. 

Isabelle’s most frequently used and largest object logic is HOL. We managed 
to check about 12% of the Main library. This includes the basic logic and the 
libraries of sets, functions, orderings, lattices and groups. The formalizations are 
non-trivial and make heavy use of Isabelle’s type classes. 

Why can we check about five times as many lines of code in ZF compared to 
HOL? Profiling revealed that the proof checker spends a lot of time in functions 
that access the signature, especially the wellformedness checks. The primary rea- 
sons: inefficient data structures (e.g. association lists) and thus the running time 
depends heavily on size of signature and increases with every new constant, type 
and class. To make matters worse, there is no sharing of any kind in terms/types 
and their wellformedness checks. Because ZF is free of polymorphism and type 
classes, these wellformedness checks are much simpler. 


12 Trust Assumptions 


We need to trust the following components outside of the formalization: 


— The verification (and code generation) of our proof checker in Isabelle/HOL. 
This is inevitable, one has to trust some theorem prover to start with. We 
could improve the trustworthiness of this step by porting our proofs to the 
verified HOL prover by Kumar et el. [13] but its code generator produces 
CakeML [14], not SML. 

— The unverified glue code in the integration of our proof checker into Isabelle 
(Sect. 10). 


Because users currently cannot examine Isabelle’s internal data structures 
that we start from, they have to trust Isabelle’s front end that parses and trans- 
forms some textual input file into internal data structures. One could add a 
(possibly verified) presentation layer that outputs those internal representations 
into a readable format that can be inspected, while avoiding the traps Adams 
[3] is concerned with. 


13 Future Work 


Our primary focus will be on scaling up the proof checker to not just deal with all 
of HOL but with real applications (including itself!). There is a host of avenues 
for exploration. Just to name a few promising directions: more efficient data 
structures than association lists (e.g. via existing frameworks [19,20]); caching 
of wellformedness checks for types and terms; exploiting sharing within terms 
and types (tricky because our intentionally simple glue code creates copies); 
working with the compressed proof terms [5] that Isabelle creates by default 
instead of uncompressing them as we do now. 
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We will also upgrade the formalization of our checker from individual the- 
orems sets of theorems, explicitly checking cyclic dependencies (which are cur- 
rently prevented by the glue code, see Sect. 10). 

A presentation layer as discussed in Sect. 12 would not just allow the inspec- 
tion of the internal representation of the theories but could also be extended to 
the proofs themselves, thus permitting checkers to be interfaced with Isabelle on 
a textual level instead of internal data structures. 

It would also be nice to have a model-theoretic semantics for M. We believe 
that the work by Kunéar and Popescu [15,16,17,18] could be adapted from HOL 
to M. This would in particular yield semantically justified cyclicity checks for 
constant and type definitions which we currently treat as axioms because a purely 
syntactic justification is unclear. 
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A Appendix 
subst-bv u t = subst-bv2 t 0 u 


subst-bv2 (Bv i) n u = (if i < n then Bv i else if i = n then u else Bv (i — 1)) 
subst-bv2 age T t) n u = Abs T (subst-bv2 t (n + 1) (lift u 0)) 
subst-bv2 (f - t) n u = subst-bv2 f n u + subst-bv2 t n u 

subst-bv2t__=t 


lift (Bv i) n = (ifn < i then Bv (i + 1) else Bv i) 
lift (Abs T t) n = Abs T (lift t (n + 1)) 
lift (f - t) n= lift fn - lifttn 

liftt_=t 


bind-fv T t = bind-fv2 T 0 t 


bind-fv2 var n (Fv v T) = (if var = (v, T) then Bv n else Fv v T) 
bind-fv2 var n (Abs T t) = Abs T (bind-fv2 var (n + 1) t) 
bind-fv2 var n (f + u) = bind-fv2 var n f + bind-fv2 var n u 
bind-fv2__t=t 
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Abstract. ksmt is a CDCL-style calculus for solving non-linear con- 
straints over the real numbers involving polynomials and transcendental 
functions. In this paper we investigate properties of the ksmt calculus 
and show that it is a -complete decision procedure for bounded prob- 
lems. We also propose an extension with local linearisations, which allow 
for more efficient treatment of non-linear constraints. 


1 Introduction 


Solving non-linear constraints is important in many applications, including verifi- 
cation of cyber-physical systems, software verification, proof assistants for math- 
ematics [25,21,2,1,15,6]. Hence there has been a number of approaches for solv- 
ing non-linear constraints, involving symbolic methods [16,23,29,18] as well as 
numerically inspired ones, in particular for dealing with transcendental func- 
tions [13,30], and combinations of symbolic and numeric methods [7,11,12]. 

In [7] we introduced the ksmt calculus for solving non-linear constraints over 
a large class of functions including polynomial, exponential and trigonometric 
functions. The ksmt calculus combines CDCL-style reasoning [28,22,3] over the 
reals based on conflict resolution [19] with incremental linearisations of non- 
linear functions using methods from computable analysis [31,24]. Our approach is 
based on computable analysis and exact real arithmetic which avoids limitations 
of double precision computations caused by rounding errors and instabilities in 
numerical methods. In particular, satisfiable and unsatisfiable results returned 
by ksmt are exact as required in many applications. This approach also supports 
implicit representations of functions as solutions of ODEs and PDEs [26]. 

It is well known that in the presence of transcendental functions the con- 
straint satisfiability problem is undecidable [27]. However if we only require so- 
lutions up to some specified precision 6, then the problem can be solved algorith- 
mically on bounded instances and that is the motivation behind 6-completeness, 


* This research was partially supported by an Intel research grant, the DFG grant 
WERA MU 1801/5-1 and the RFBR-JSPS 20-51-5000 grant. 

4 Implementation is available at http://informatik-uni-trier.de/~ brausse/ksmt / 
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which was introduced in [13]. In essence a 6-complete procedure decides if a 
formula is unsatisfiable or a ô weakening of the formula is satisfiable. 

In this paper we investigate theoretical properties of the ksmt calculus, and 
its extension 6-ksmt for the 6-SMT setting. Our main results are as follows: 


1. We introduced a notion of e-full linearisations and prove that all e-full runs 
of ksmt are terminating on bounded instances. 

2. We extended the ksmt calculus to the 6-satisfiability setting and proved that 
6-ksmt is a d-complete decision procedure for bounded instances. 

3. We introduced an algorithm for computing e€full local linearisations and 
integrated it into 6-ksmt. Local linearisations can be used to considerably 
narrow the search space by taking into account local behaviour of non-linear 
functions avoiding computationally expensive global analysis. 


In Section 3, we give an overview about the ksmt calculus and introduce 
the notion of e-full linearisation used throughout the rest of the paper. We 
also present a completeness theorem. Section 4 introduces the notion of ð- 
completeness and related concepts. In Section 5 we introduce the 6-ksmt adapta- 
tion, prove it is correct and 6-complete, and give concrete effective linearisations 
based on a uniform modulus of continuity. Finally in Section 6, we introduce local 
linearisations and show that termination is independent of computing uniform 
moduli of continuity, before we conclude in Section 7. 


2 Preliminaries 


The following conventions are used throughout this paper. By ||- || we denote 
the maximum-norm ||(#1,v2,...,%n)|| = max{|a;| : 1 < i < n}. When it helps 
clarity, we write finite and infinite sequences # = (21,...,%,) and y = (y;); in 
bold typeface. We are going to use open balls B(c, €) = {a : ||a — e|| < e} C R” 
for c € R” and «€ > 0 and A to denote the closure of the set A C R” in the 
standard topology induced by the norm. By Qs we denote the set {q E€ Q: 
q > 0}. For sets X,Y, a (possibly partial) function from X to Y is written as 
X — Y. We use the notion of compactness: a set A is compact iff every open 
cover of A has a finite subcover. In Euclidean spaces this is equivalent to A being 
bounded and closed [32]. 


Basic Notions of Computable Analysis 


Let us recall the notion of computability of functions over real numbers used 
throughout this paper. A rational number q is an n-approximation of a real 
number z if ||g— || < 27”. Informally, a function f is computed by a function- 
oracle Turing machine M;, where ° is a placeholder for the oracle representing 
the argument of the function, in the following way. The real argument x is repre- 
sented by an oracle function y : N > Q, for each n returning an n-approximation 
Pn Of x. For simplicity, we refer to y by the sequence (Yn)n. When run with ar- 
gument p € N, MF (p) computes a rational p-approximation of f(a) by querying 
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its oracle y for approximations of x. Let us note that the definition of the oracle 
machine does not depend on the concrete oracle, i.e., the oracle can be seen as a 
parameter. In case only the machine without a concrete oracle is of interest, we 
write M}. We refer to [17] for a precise definition of the model of computation 
by function-oracle Turing machines which is standard in computable analysis. 


Definition 1 ([17]). Consider x € R”. A name for x is a rational sequence 
yp = (Pr) such that Vk : ||\px—2|| < 2-*. A function f : R” — R is computable 
iff there is a function-oracle Turing machine M; such that for all x € dom f 
and names ~ for x, |MF (p) — f(x)| < 27” holds for all p € N. 


This definition is closely related to interval arithmetic with unrestricted pre- 
cision, but enhanced with the guarantee of convergence and it is equivalent 
to the notion of computability used in [31]. The class of computable functions 
contains polynomials and transcendental functions like sin, cos, exp, among oth- 
ers. It is well known [17,31] that this class is closed under composition and 
that computable functions are continuous. By continuity, a computable function 
f: R” —> R total on a compact D C R” has a computable uniform modulus of 
continuity uf : N —> N on D [31, Theorem 6.2.7], that is, 


Vk ENVy,z € D: ly = z| < 2 => |fly)—f(z)| <27*. (2.1) 


A uniform modulus of continuity of f expresses how changes in the value of f 
depend on changes of the arguments in a uniform way. 


3 The ksmt Calculus 


We first describe the ksmt calculus for solving non-linear constraints |7] infor- 
mally, and subsequently recall the main definitions which we use in this paper. 
The ksmt calculus consists of transition rules, which, for any formula in linear 
separated form, allow deriving lemmas consistent with the formula and, in case 
of termination, produce a satisfying assignment for the formula or show that it 
is unsatisfiable. A quantifier-free formula is in separated linear form £L UN if 
L£ is a set of clauses over linear constraints and M is a set of non-linear atomic 
constraints; this notion is rigorously defined below. 

In the ksmt calculus there are four transition rules applied to its states: 
Assignment refinement (A), Conflict resolution (R), Backjumping (B) and Lin- 
earisation (L). The final ksmt states are sat and unsat. A non-final ksmt state is 
a triple (a, £,N) where a is a (partial) assignment of variables to rationals. A 
ksmt derivation starts with an initial state where a is empty and tries to extend 
this assignment to a solution of L U N by repeatedly applying the Assignment 
refinement rule. When such assignment extension is not possible we either ob- 
tain a linear conflict which is resolved using the conflict resolution rule, or a 
non-linear conflict which is resolved using the linearisation rule. 

The main idea behind the linearisation rule is to approximate the non-linear 
constraints around the conflict using linear constraints in such a way that the 
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separated linear form 


p.lin.incons. 


p.lin.cons. 


p.nlin.incons. 


Fig. 1. Core of ksmt calculus. Derivations terminate in red nodes. 


conflict will be shifted into the linear part where it will be resolved using conflict 
resolution. Application of either of these two rules results in a state containing a 
clause evaluating to false under the current assignment. This is followed by either 
application of the backjumping rule, which undoes assignments or by termination 
in case the formula is unsat. In this procedure, only the assignment and linear 
part of the state change and the non-linear part stays fixed. 


Notations. Let Fiji, consist of rational constants, addition and multiplication by 
rational constants; Fanı denotes an arbitrary collection of non-linear computable 
functions including transcendental functions and polynomials over the reals. We 
consider the structure (R, (Fin U Fn, P)) where P = {<,<,>,>,=,4} anda 
set of variables V = {21,22,...,%n,.-.}. We will use, possibly with indices, x to 
denote variables and q,c,e for rational constants. Define terms, predicates and 
formulas over V in the standard way. An atomic linear constraint is a formula of 
the form: q + c1£1 +...+¢n%n © 0 where g,c1,...,€n E Q and o € P. Negations 
of atomic formulas can be eliminated by rewriting the predicate symbol o in the 
standard way, hence we assume that all literals are positive. A linear constraint is 
a disjunction of atomic linear constraints, also called (linear) clause. An atomic 
non-linear constraint is a formula of the form f(x) o0, where o € P and f 
is a composition of computable non-linear functions from Fy) over variables 
zx. Throughout this paper for every computable real function f we use M; to 
denote a function-oracle Turing machine computing f. We assume quantifier-free 
formulas in separated linear form |7, Definition 1], that is, CUM where £ is a set 
of linear constraints and M is a set of non-linear atomic constraints. Arbitrary 
quantifier-free formulas can be transformed equi-satisfiably into separated linear 
form in polynomial time [7, Lemma 1]. Since in separated linear form all non- 
linear constraints are atomic we will call them just non-linear constraints. 

Let a: V > Q be a partial variable assignment. The interpretation [a] 
of a vector of variables x under a is defined in a standard way as component- 
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wise application of a. Define the notation [t]° as evaluation of term t under 
assignment a, that can be partial, in which case [t]* is treated symbolically. We 
extend [-]“ to predicates, clauses and CNF in the usual way and true, false denote 
the constants of the Boolean domain. The evaluation fto 0] for a predicate © 
and a term t results in true or false only if all variables in t are assigned by a. 

In order to formally restate the calculus, the notions of linear resolvent and 
linearisation are essential. A resolvent Ra,c,, on a variable z is a set of linear con- 
straints that do not contain z, are implied by the formula £ and which evaluate 
to false under the current partial assignment a; for more details see [19,7]. 


Definition 2. Let P be a non-linear constraint and let a be an assignment with 
[P]° = false. A linearisation of P at a is a linear clause C with the properties: 


1. VB: [P] = true [C]° = true, and 
2. [C]* = false. 


Wlog. we can assume that the variables of C are a subset of the variables of P. 
Let us note that any linear clause C represents the complement of a rational 
polytope R and we will use both interchangeably. Thus for a rational polytope 
R, x ¢ R also stands for a linear clause. In particular, any linearisation excludes 
a rational polytope containing the conflicting assignment from the search space. 


Transition rules. For a formula Lo UN in separated linear form, the initial ksmt 
state is (nil, Co, M). The calculus consists of the following transition rules from 
a state S = (a,£,N) to S: 


(A) Assignment. S’ = (a :: z > q, L, N) iff [L]% Æ false and there is a variable 
z unassigned in a and q € Q with [L]°"'*"7? F false. 

(R) Resolution. S! = (a, L U Ra,c,z,N) iff [L]* F false and there is a variable 
z unassigned in a with Yq € Q : [£]°"*74 = false and Ry ¢,, is a resolvent. 

(B) Backjump. S = (y,£,N) iff [Z]*% = false and there is a maximal prefix 7 
of a such that [£]7 F false. 

(L) Linearisation. S’ = (a, LU {La p}, N) iff [L]°% F false, there is P in N with 
[P]° = false and there is a linearisation Ly, p of P at a. 

(F's) Final sat. S’ = sat if all variables are assigned in a, [£]° = true and 
none of the rules (A), (R), (B), (L) is applicable. 

(Fursat) Final unsat. S’ = unsat if [L]" = false. In other words a trivial con- 
tradiction, e.g., 0 > 1 is in £. 


A path (or a run) is a derivation in a ksmt. A procedure is an effective 
(possibly non-deterministic) way to construct a path. 


Termination. If no transition rule is applicable, the derivation terminates. For 
clarity, we added the explicit rules (F'%*) and (F“"S@*) which lead to the final 
states. This calculus is sound [7, Lemma 2]: if the final transition is (F'°™), then 
a is a solution to the original formula, or (F'“"S*'), then a trivial contradiction 
0 > 1 was derived and the original formula is unsatisfiable. The calculus also 
makes progress by reducing the search space [7, Lemma 3]. 
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Linearisation of P on conflicts (x, y) at œ here: 


C = (y < 1/x) A (£/4+1 < y) — choose d := (1/[x]* + [y]“)/2, 
w <4: (@— 1) ~C=(r<1/dv y<d) 

a (( S 15) VS 2) rule ja note 

A ((@ < 333) V U < 35) (A) zo? 
^ ($ < z) ^ (z < 328) (A) whys (3a) 
A(4 < 229) (L) |x= 2, yrs 3 | (3b) 

: ~ (B) |jz= 2 
(A) |z = 2, yr 8} (4a) 
(L) eag 88] (ab) 

(B) |x= 2 
(R) |jz=>2 on y 

(B) 

(R) on x 
7 (peent) unsat 


Fig. 2. unsat example run of ksmt using interval linearisation [|7]. 


An example run of the ksmt calculus is presented in Figure 2. We start in a 
state with a non-linear part M = {y < 1/x}, which defines the pink area and 
the linear part £ = {(x/4+1 < y), (y < 4- (x — 1))}, shaded in green. Then we 
successively apply ksmt rules excluding regions around candidate solutions by 
linearisations, until we derive linearisations which separates the pink area from 
the green area thus deriving a contradiction. 


Remark 1. In general a derivation may not terminate. The only cause of non- 
termination is the linearisation rule which adds new linear constraints and can 
be applied infinitely many times. To see this, observe that ksmt with only the 
rules (A), (R), (B) corresponds to the conflict resolution calculus which is known 
to be terminating [19,20]. Thus, in infinite ksmt runs the linearisation rule (L) is 
applied infinitely often. This argument is used in the proof of Theorem 1 below. 
Let us note that during a run the ksmt calculus neither conflicts nor lemmas can 
be generated more than once. In fact, any generated linearisation is not implied 
by the linear part, prior to adding this linearisation. 


3.1 Sufficient Termination Conditions 


In this section we will assume that (a, £L, M) is a ksmt state obtained by applying 
ksmt inference rules to an initial state. As in [13] we only consider bounded 
instances. In many applications this is a natural assumption as variables usually 
range within some (possibly large) bounds. We can assume that these bounds 
are made explicit as linear constraints in the system. 


Definition 3. Let F be the formula Lo AN in separated linear form over vari- 
ables £1,...,&n and let B; be the set defined by the conjunction of all clauses 
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in Lo univariate in xi, fori =1,...,n; in particular, if there are no univariate 
linear constraints over x; then Bi =R. We call F a bounded instance if: 


— Dp := Xi B; is bounded, and 
— for each non-linear constraint P : f(xi,,..-,2i,)00 in N witht; € {1,...,n} 
forj € {1,...,k} it holds that Dp C dom f where Dp := Xa Bi: 


By this definition, already the linear part of bounded instances explicitly defines 
a bounded set by univariate constraints. Consequently, the set of solutions of F 
is bounded as well. 

In Theorem 1 we show that when we consider bounded instances and restrict 
linearisations to so-called e-full linearisations, then the procedure terminates. 
We use this to show that the ksmt-based decision procedure we introduce in 
Section 5 is ĝ-complete. 


Definition 4. Let e > 0, P be a non-linear constraint over variables x and let 
a be an assignment of x. A linearisation C of P at a is called «full iff for all 
assignments B of x with [a]? € B([a]*,«), [C]? = false. 

A ksmt run is called €-full for some e > 0, if all but finitely many linearisa- 
tions in this run are e-full. 


The next theorem provides a basis for termination of ksmt-based decision 
procedures for satisfiability. 


Theorem 1. Lete > 0. On bounded instances, €-full ksmt runs are terminating. 


Proof. Let F : Lo AN be a bounded instance and e€ > 0. Towards a contradic- 
tion assume there is an infinite e-full derivation (ao, £o,N),---,(Qn,Ln,N),--- 
in the ksmt calculus. Then, by definition of the transition rules, L C L; for 
all k,l with 0 < k < l. According to Remark 1 in any infinite derivation the 
linearisation rule must be applied infinitely many times. During any run of ksmt 
the set of non-linear constraints M is fixed and therefore there is a non-linear 
constraint P in N over variables x to which linearisation is applied infinitely 
often. Let (ai,,£:,,N),..-,(ai,,Li,,M),... be a corresponding subsequence 
in the derivation such that Ci, € £i,41,...,Ci, E Li,41,... are €full lineari- 
sations of P. Consider two different linearisation steps k,l € {i; : 7 © N} in 
the derivation where k < ¢. By the precondition of rule (L) applied in step £ 
we have [L:]° # false. In particular the linearisation Ck E€ Lk+1 C Le of P 
constructed in step k does not evaluate to false under ag. Since the set of vari- 
ables in C;, is a subset of those in P, [C;]°’ # false implies [C;,]°* = true. 
By assumption, the linearisation C% is «full, thus from Definition 4 it follows 
that [a]°* ¢ B([a]°*,¢). Therefore the distance between [a]°* and [a]? is 
at least e. However, every conflict satisfies the variable bounds defining Dp, so 
there could be only finitely many conflicts with pairwise distance at least e. This 
contradicts the above. 


Concrete algorithms to compute e-full linearisations are presented in Sec- 
tions 5 and 6. 
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6-sat 


unsat 


> f(x) 
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Fig. 3. The overlapping cases in the 6-SMT problem f(x) < 0. 


4 6-decidability 


In the last section, we proved termination of the ksmt calculus on bounded 
instances when linearisations are e-full. Let us now investigate how e-full lin- 
earisations of constraints involving non-linear computable functions can be con- 
structed. To that end, we assume that all non-linear functions are defined on the 
closure of the bounded space Dp defined by the bounded instance F. 

So far we described an approach which gives exact results but at the same 
time is necessarily incomplete due to undecidability of non-linear constraints in 
general. On the other hand, non-linear constraints usually can be approximated 
using numerical methods allowing to obtain approximate solutions to the prob- 
lem. This gives rise to the bounded 6-SMT problem [13] which allows an overlap 
between the properties 6-sat and unsat of formulas as illustrated by Figure 3. It 
is precisely this overlap that enables 6-decidability of bounded instances. 

Let us recall the notion of 6-decidability, adapted from [13]. 


Definition 5. Let F be a formula in separated linear form and let 6 € Qso. We 
inductively define the d-weakening Fs of F. 


— If F is linear, let Fs := F. 
— If F is a non-linear constraint f(x) o0, let 


f(x) -— 600, ifoe {<,<} 
p. J Fe) +500, ifo € {>,2} 
oI F(@)|- 6 <0, ifo € {=} 


(f(x) < OV fæ) >0)5, ifo c {A}. 
— Otherwise, F is Ao B with o € {A,V}. Let Fs := (As o Bs). 
-deciding F designates computing 


unsat, if |F]® = false for all a 
ô-sat, if |F5]® = true for some a. 


In case both answers are valid, the algorithm may output any. 
An assignment a with |F5]% = true we call a 6-satisfying assignment for F. 


For non-linear constraints P this definition of the 6-weakening Ps corresponds ex- 
actly to the notion of -weakening P~° used in the introduction of 5-decidability 
[14, Definition 4.1]. 
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Remark 2. The 6-weakening of a non-linear constraint f(x) Æ 0 is a tautology. 


We now consider the problem of 6-deciding quantifier-free formulas in sepa- 
rated linear form. The notion of 6-decidability is slightly stronger than in [13] 
in the sense that we do not weaken linear constraints. Consider a formula F in 
separated linear form. As before, we assume variables æ to be bounded by linear 
constraints x € Dp. We additionally assume that for all non-linear constraints 
P : f(@)o0in N, f is defined on Dp and, in order to simplify the presentation, 
throughout the rest of paper we will assume only the predicates o € {>,>} are 
part of formulas, since the remaining ones <, <,= can easily be expressed by the 
former using simple arithmetic transformations, and by Remark 2 predicates # 
are irrelevant for 6-deciding formulas. 

An algorithm is d-complete, if it d-decides bounded instances [13]. 


5 d6-ksmt 


Since )-decidability as introduced above adapts the condition when a formula 
is considered to be satisfied to d-sat, this condition has to be reflected in the 
calculus, which we show solves the bounded 6-SMT problem in this section. 
Adding the following rule (F'§*’) together with the new final state d-sat to ksmt 
relaxes the termination conditions and turns it into the extended calculus we 
call 6-ksmt. 


(F3%) Final 6-sat. If (a,£,N) is a 6-ksmt state where a is a total assignment 
and [£ A N5]* = true, transition to the d-sat state. 


The applicability conditions on the rules (L) and (F$°%*) individually are not 
decidable [27,5], however, when we compute them simultaneously, we can effec- 
tively apply one of these rules, as we will show in Lemma 3. In combination with 
e-fullness of the computed linearisations (Lemma 4), this leads to Theorem 3, 
showing that 6-ksmt is a -complete decision procedure. 

Let us note that if we assume ô = 0 then 6-ksmt would just reduce to ksmt 
as (FS) and (Fs) become indistinguishable, but in the following we always 
assume ô > 0. 

In the following sub-section, we prove that terminating derivations of the 6- 
ksmt calculus lead to correct results. Then, in Section 5.2, we present a concrete 
algorithm for applying rules (L) and (F$%%) and show its linearisations to be 
e-full, which is sufficient to ensure termination, as shown in Theorem 1. These 
properties lead to a 6-complete decision procedure. In Section 6 we develop a 
more practical algorithm for ¢-full linearisations that does not require computing 
a uniform modulus of continuity. 


5.1 Soundness 


In this section we show soundness of the 6-ksmt calculus, that is, validity of 
its derivations. In particular, this implies that derivability of the final states 
unsat, d-sat and sat directly corresponds to unsatisfiability, d-satisfiability and 
satisfiability of the original formula, respectively. 
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Lemma 1. For all 6-ksmt derivations of S = (a',L',N) from a state S = 
(a, L, N) and for all total assignments B, [CAN]? =[L'AN]?. 


Proof. Let 8 be a total assignment of the variables in £ A M. Since the set of 
variables remains unchanged by 6-ksmt derivations, 3 is a total assignment for 
L'ANN as well. Let S” = (a', L', N) be derived from S = (a,£,N) by a single 
application of one of 6-ksmt rules. By the structure of S’, its derivation was 
not caused by neither (F“"%), (FS) or (F§%). For rules (A) and (B) there 
is nothing to show since £L = £’. If (R) caused S + S’, the claim holds by 
soundness of arithmetical resolution. Otherwise (L) caused S ++ S’ in which 
case the direction = follows from the definition of a linearisation (condition 1 
in Definition 2) while the other direction trivially holds since £ C £’. 

The condition on derivations of arbitrary lengths then follows by induction. 


Lemma 2. Let 6 € Qs. Consider a formula G = Lo AN in separated linear 
form and let S = (a,£,N) be a 6-ksmt state derivable from the initial state 
So = (nil, £Lo,N). The following hold. 


— If rule (F%"S*) is applicable to S then G is unsatisfiable. 

— If rule (F§*) is applicable to S then a is a 6-satisfying assignment for G, 
hence G is 6-satisfiable. 

— If rule (FS) is applicable to S then a is a satisfying assignment for G, 
hence G is satisfiable. 


Proof. Let formula G and states So, S be as in the premise. As S' is not final 
in 6-ksmt, only ksmt rules have been applied in deriving it. The statements for 
rules (F“"S*) and (fF) thus hold by soundness of ksmt |7, Lemma 2]. 

Assume (F$°%*) is applicable to S, that is, [£ A Ns]® is true. Then, since 
Lo CL, we conclude that a satisfies Lo A Ns which, according to Definition 5, 
equals Gs. Therefore a is a d-satisfying assignment for G. 


Since the only way to derive one of the final states unsat, d-sat and sat from the 
initial state in 6-ksmt is by application of the rule (F“"S%), (Fg%) and (F°%*), 
respectively, as corollary of Lemmas 1 and 2 we obtain soundness. 


Theorem 2 (Soundness). Let 5 € Qso. The 6-ksmt calculus is sound. 


5.2 -completeness 


We proceed by introducing Algorithm 1 computing linearisations and deciding 
which of the rules (F$°*) and (L) to apply. These linearisations are then shown 
to be e-full for some e > 0 depending on the bounded instance. By Theorem 1, 
this property implies termination, showing that 6-ksmt is a 6-complete decision 
procedure. 

Given a non-final 6-ksmt state, the function NLINSTEP; in Algorithm 1 com- 
putes a ð-ksmt state derivable from it by application of (F$%*) or (L). This is 
done by evaluating the non-linear functions and adding a linearisation £ based 
on their uniform moduli of continuity as needed. To simplify the algorithm, it 
assumes total assignments as input. It is possible to relax this requirement, e.g., 
by invoking rules (A) or (R) instead of returning d-sat for partial assignments. 
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Algorithm 1 (NLINSTEP;) Algorithm computing a 6-ksmt derivation according 
to either rule (L) or (F'§*’) from a state (a, £,N) where a is total. The functions 


are assumed to be computed by machines M ? and to be a computable 
f uf 
uniform modulus of continuity of f. 


function LINEARISEs( f, £, ©, a) 


compute p > —|log,(min{1, 8/4})] 


pe (n= [e]°) 

c 4 27E (P) 

ğ— MF (p) 

if yo —6/2 then 
return None 

end if 

return (x ¢ B([a]“,c)) 


function NLINSTEP;(a, L, N) 
for P : (f(a) 0) in N do 
L + LINEARISEs(f, £, ©, a) 
if 2 Æ None then 


return (a, LU {f},N) > (L) 
end if 
end for 
return 6-sat > (FR) 


end function 


end function 


Lemma 3. Let ô € Qso and let S = (a,£L,N) be a 6-ksmt state where a is 
total and [£L]° = true. Then NLINSTEP5 (a, L,N ) computes a state derivable by 
application of either (L) or (F$%") to S. 


Proof. In the proof we will use notions from computable analysis, as defined 
in Section 2. Let (a, £L,M) be a state as in the premise and let P : f(x) o0 
be a non-linear constraint in M. Let M; compute f as in Algorithm 1. The 


algorithm computes a rational approximation y = M TAN: (p) of f([£]~) where 
p > —|loga(min{1,8/4})] € N. [£]®“ = true implies |æ] € Dp C dom f, thus 
the computation of y terminates. Since M; computes f, y is accurate up to 
2-? < 6/4, that is, J € [f([a]°) + 6/4]. By assumption o € {>, >}, thus 
1. yo —d/2 implies f([a]°) o —d, which is equivalent to [Ps5]° = true, and 
2. a(y¥ e —d/2) implies =(f([a]%) o —8/2 + 6/4), which in turn implies [P]* = 
false and the applicability of rule (L). 


For Item 1 no linearisation is necessary and indeed the algorithm does not lin- 
earise P. Otherwise (Item 2), it adds the linearisation (æ ¢ B([a]°,¢)) to the 
linear clauses. Since [x]“ € Dp by Eq. (2.1) we obtain that 0 ¢ B(f(z), 6/4) 
holds, implying —(f(z)¢0), for all z € B([æ]“, ©) Dp. Hence, (x ¢ B([x]*,6)) 
is a linearisation of P at a. 

In case NLINSTEPs(a,£,N) returns 6-sat, the premise of Item 1 holds for 
every non-linear constraint in M, that is, [M5] = true. By assumption [L]° = 
true, hence the application of the (F§%") rule deriving 6-sat is possible in 6-ksmt. 


Lemma 4. For any bounded instance Lo AN there is a computable € € Qso 
such that any 6-ksmt run starting in (nil, £5,N), where applications of (L) and 
(F%) are performed by NLINSTEPs, is €-full. 


Proof. Let P : f(a) 0 be a non-linear constraint in M. Since Lo AN is a 
bounded instance, Dp C R” is also bounded. Let ep := 27?) where p > 
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—|log,(min{1, 8/4})] € N as in Algorithm 1. As jy is a uniform modulus of con- 
tinuity, the inequalities in the following construction hold on the whole domain 
Dp of f and do not depend on the concrete assignment a where the linearisa- 
tion is performed. Since logy and yy are computable, so are p and ep. There 
are finitely many non-linear constraints P in M, therefore the linearisations the 
algorithm NLINSTEPs computes are e-full with € = minfep : P in M} > 0. 


We call 6-ksmt derivations when linearisation are computed using Algo- 
rithm 1 6-ksmt with full-box linearisations, or 6-ksmt-fb for short. As the runs 
computed by it are e-full for € > 0, by Theorem 1 they terminate. 


Theorem 3. 6-ksmt-fb is a -complete decision procedure. 


Proof. 6-ksmt-fb is sound (Theorem 2) and terminates on bounded instances 
(Theorem 1 and Lemma 4). 


6 Local e-full Linearisations 


In practice, when the algorithm computing e¢-full linearisations described in the 
previous section is going to be implemented, the question arises of how to get a 
good uniform modulus of continuity uş for a computable function f. Depending 
on how f is given, there may be several ways of computing it. Implementations 
of exact real arithmetic, e.g., i1RRAM [24] and Ariadne [2], are usually based on 
the formalism of function-oracle Turing machines (see Definition 1) which allow 
to compute with representations of computable functions [10] including implicit 
representations of functions as solutions of ODEs/PDEs [26,9]. If f is only avail- 
able as a function-oracle Turing machine M? computing it, a modulus py valid 
on a compact domain can be computed, however, in general this is not possible 
without exploring the behaviour of the function on the whole domain, which in 
many cases is computationally expensive. Moreover, since jf is uniform, s(n) 
is constant throughout Dp, independent of the actual assignment a determining 
where f is evaluated. Yet, computable functions admit local moduli of continuity 
that additionally depend on the concrete point in their domain. In most cases 
these would provide linearisations with e larger than that determined by pf lead- 
ing to larger regions being excluded, ultimately resulting in fewer linearisation 
steps and general speed-up. Indeed, machines producing finite approximations 
of f(a) from finite approximations of x internally have to compute some form of 
local modulus to guarantee correctness. In this section, we explore this approach 
of obtaining linearisations covering a larger part of the function’s domain. 

In order to guarantee a positive bound on the local modulus of continuity 
extracted directly from the run of the machine M; computing f, it is neces- 
sary to employ a restriction on the names of real numbers M; computes on. 
The set of names should in a very precise sense be “small”, i.e., it has to be 
compact. The very general notion of names used in Definition 1 is too broad to 
satisfy this criterion since the space of rational approximations is not even locally 
compact. Here, we present an approach using practical names of real numbers as 
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sequences of dyadic rationals of lengths restricted by accuracy. For that purpose, 
we introduce another representation [31] of R, that is, the surjective mapping 
£: Do > R. Here, Dy denotes the set of infinite sequences y of dyadic rationals 
with bounded length. If y has a limit (in R), we write lim vy. 


Definition 6. — For k € w let Dy := Z-2-*t) = {m/2*t!: me Z} CQ 
and let Du = Xz ¢,, De be the set of all sequences (yr), with pk € Dy for 
all k € w. By default, Do is endowed with the Baire space topology, which 
corresponds to that induced by the metric 


E voy 


1/min{l+n:n€w,pn An} otherwise. 


— Define € : Dy — R as the partial function mapping y € D, to imọ iff 
Vi, j : |pi — Pi+;l < 2-4), Any vy € €! (x) is called a €-name of x € R. 

— The representation p : (£k)k œ> x mapping names (£k)k of x ER to x as per 
Definition 1 is called Cauchy representation. 


Using a standard product construction we can easily generalise the notion of 
-names to €"-names of R”. When clear from the context, we will drop n and 
just write € to denote the corresponding generalised representation D% > R”. 

Computable equivalence between two representations not only implies that 
there are continuous maps between them but also that names can computably 
be transformed [31]. Since the Cauchy representation itself is continuous [4] we 
derive continuity of €, which is used below to show compactness of preimages 
€—1(X) of compact sets X C R under £. All proofs can be found in [8]. 


Lemma 5. The following properties hold for £. 


1. € is a representation of R”: it is well-defined and surjective. 
2. Any &-name of x € R” is a Cauchy-name of a. 

3. E is computably equivalent to the Cauchy representation. 

4. E is continuous. 


The converse of Item 2 does not hold. An example for a Cauchy-name of 0 € R 
is the sequence (£n)n with x, = (—2)~” for all n € w, which does not satisfy 
Vi, j : la; — xij] < 270+). However, given a name of a real number, we can 
compute a corresponding €-name, this is one direction of the property in Item 3. 

As a consequence of Item 2 a function-oracle machine M? computing f : 
IR” — R according to Definition 1 can be run on -names of x € R” leading 
to valid Cauchy-names of f(a). Note that this proposition does not require 
M; to compute a -name of f(a). Any rational sequence rapidly converging 
to f(a) is a valid output. This means, that the model of computation remains 
unchanged with respect to the earlier parts of this paper. It is the set of names the 
machines are operated on, which is restricted. This is reflected in Algorithm 2 
by computing dyadic rational approximations %ę of [a]® such that a, € D? 
instead of keeping the name of [æ]“ constant as has been done in Algorithm 1. 
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Algorithm 2 (Local linearisation) Algorithm ð-deciding P : f(a) o0 and — 
in case unsat — computing a linearisation at a or returning “None” and in this 
case q satisfies Ps. The function f is computed by machine M$. 


function LINEARISELOCALs(f, £, ©, a) 
p + (m > approx([a]*,m)) > then ¢ is a €-name of [a]|* 
compute p > —|log,(min{1, 6/4}) | 
run M? (p + 2), record its output y and its maximum query k € w to ọ 
if yo —6/2 then 
return None 
else 
return (æ ¢ B([a]*,2~*)) 
end if 
end function 


In particular, in Theorem 4 we show that linearisations for the (Zs) rule can 
be computed by Algorithm 2, which — in contrast to LINEARISE, in Algorithm 1 
— does not require access to a procedure computing an upper bound py on the 
uniform modulus of continuity of the non-linear function f € Fy) valid on the 
entire bounded domain. It not just runs the machine M;, but also observes the 
queries M F poses to its oracle in order to obtain a local modulus of continuity of 
f at the point of evaluation. The function approx(x,m) := |æ - 2™+1]/2™*1 used 
to define Algorithm 2 computes a dyadic approximation of æ, with |-] : Q” > Z” 
denoting a rounding operation, that is, it satisfies Yq : |||q] — ql| < 5. On 
rationals (our use-case), |-] is computable by a classical Turing machine. 


? 


Definition 7 ([31, Definition 6.2.6]). Let f : R” > R and x € dom f. A 
function y : N > N is called a (local) modulus of continuity of f at x if for all 
pEN andy € domf, |æ — yl| < 2-7) => |f(æ)— f(y)| < 27? holds. 


We note that in most cases a local modulus of continuity of f at æ is smaller 
than the best uniform modulus of f on its domain, since it only depends on the 
local behaviour of f around x. One way of computing a local modulus of f at x 
is using the function-oracle machine M; as defined next. 


Definition 8. Let M; compute f : R” —> R and let x € dom f have Cauchy- 
name p. The function Ymy iP max{0, k : MẸ (p +2) queries index k of p} 
is called the effective local modulus of continuity induced by M? at yp. 


The effective local modulus of continuity of f at a name y of x € dom f indeed is 
a local modulus of continuity of f at æ [17, Theorem 2.13]. Algorithm 2 computes 
e-full linearisations by means of the effective local modulus [8], as stated next. 


Lemma 6. Let P : f(x)¢0 be a non-linear constraint in N and a be an assign- 
ment of x to rationals in dom f. Whenever C = LINEARISELOCAL; (f,2,°, a) 
and C # None, C is an €-full linearisation of P at a, with e corresponding to 
the effective local modulus of continuity induced by M; at a €-name of [a]. 
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Thus, the function LINEARISELOCAL; in Algorithm 2 is a drop-in replacement 
for LINEARISEs in Algorithm 1 since the condition on returning a linearisation of 
P versus accepting Ps is identical. The linearisations however differ in the radius 
€, which now, according to Lemma 6, corresponds to the effective local modulus 
of continuity. The resulting procedure we call NLINSTEPLOCALs;. One of its ad- 
vantages over NLINSTEPs is running M? on -names instead of Cauchy-names, 
is that they form a compact set for bounded instances, unlike the latter. This 
allows us to bound e > 0 for the computed e-full local linearisations of otherwise 
arbitrary 6-ksmt runs. A proof of the following Lemma showing compactness of 
preimages €~!(X) of compact sets X C R under £ is given in [8]. 


Lemma 7. Let X C R” be compact. Then the set €~1(X) C D? of €-names of 
elements in X is compact as well. 


The proof involves showing £~!(X) to be closed and uses the fact that for each 
component pp of names (Yk)k of x € X there are just finitely many choices 
from Dy due to the restriction of the length of the dyadics. This is not the case 
for the Cauchy representation used in Definition 1 and it is the key for deriving 
existence of a strictly positive lower bound e€ on the e-fullness of linearisations. 


Theorem 4. Let 6 € Qo. For any bounded instance Ly AN there is € > 0 
such that any 6-ksmt run starting in (nil, £5,N), where applications of (L) and 
(F$*") are performed according to NLINSTEPLOCAL,, is €-full. 


Proof. Assume Lo AN is a bounded instance. Set € := min{ep : P € N}, where 
ep is defined as follows. Let P : f(a)o0 in M. Then the closure Dp of the bounded 
set Dp is compact. Let E be the set of €&-names of elements of Dp C dom f (see 
Definition 6) and for any y €E E let ky be defined as YM?,e(P) (see Definition 8) 
where p is computed from 6 as in Algorithm 2 and is independent of y. Since the 
preimage of each kọ is open, the function y ++ ky is continuous. By Lemma 7 
the set E is compact, thus, there is ùy € E such that 2~* = inf{2-*e : p € E}. 
Set ep := 27%». The claim then follows by Lemma 6. 


Thus we can conclude. 


Corollary 1. 6-ksmt with local linearisations is a d6-complete decision procedure. 


7 Conclusion 


In this paper we extended the the ksmt calculus to the 6-satisfiability setting 
and proved that the resulting d-ksmt calculus is a d-complete decision procedure 
for solving non-linear constraints over computable functions which include poly- 
nomials, exponentials, logarithms, trigonometric and many other functions used 
in applications. We presented algorithms for constructing ¢-full linearisations 
ensuring termination of 6-ksmt. Based on methods from computable analysis 
we presented an algorithm for constructing local linearisations. Local lineari- 
sations exclude larger regions from the search space and can be used to avoid 
computationally expensive global analysis of non-linear functions. 
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Abstract. The problem of invariant checking in parametric systems — 
which are required to operate correctly regardless of the number and 
connections of their components — is gaining increasing importance in 
various sectors, such as communication protocols and control software. 
Such systems are typically modeled using quantified formulae, describ- 
ing the behaviour of an unbounded number of (identical) components, 
and their automatic verification often relies on the use of decidable frag- 
ments of first-order logic in order to effectively deal with the challenges 
of quantified reasoning. 

In this paper, we propose a fully automatic technique for invariant check- 
ing of parametric systems which does not rely on quantified reason- 
ing. Parametric systems are modeled with array-based transition sys- 
tems, and our method iteratively constructs a quantifier-free abstraction 
by analyzing, with SMT-based invariant checking algorithms for non- 
parametric systems, increasingly-larger finite instances of the parametric 
system. Depending on the verification result in the concrete instance, the 
abstraction is automatically refined by leveraging canditate lemmas from 
inductive invariants, or by discarding previously computed lemmas. 

We implemented the method using a quantifier-free SMT-based IC3 
as underlying verification engine. Our experimental evaluation demon- 
strates that the approach is competitive with the state of the art, solving 
several benchmarks that are out of reach for other tools. 


Keywords: Parametric Systems - Array-based transitions systems - 
Abstraction-refinement - SMT 


1 Introduction 


Parametric systems consist of a finite but unbounded number of components. Ex- 
amples include communication protocols (e.g. leader election), feature systems, 
or control algorithms in various application domains (e.g. railways interlocking 
logics). The key challenge is to prove the correctness of the parametric system 
for all possible configurations corresponding to instantiations of the parameters. 
Parametric systems can be described as symbolic array-based transition sys- 
tems [10], where the dependence on the configuration is expressed with first-order 
quantifiers in the initial condition and the transition relation of the model. 
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In this paper, we propose a fully automated approach for solving the uni- 
versal invariant problem of array-based systems. The distinguishing feature is 
that the approach, grounded in SMT, does not require dealing with quantified 
theories, with obvious computational advantages. The algorithm implements an 
abstraction-refinement loop, where the abstract space is a quantifier-free transi- 
tion system over some SMT theories. Our inspiration and starting point is the Pa- 
rameter Abstraction of [3,15], which we extend in two directions. First, we modify 
the definition of the abstraction, by introducing a set of different environment 
variables, which intuitively overapproximate the behaviour of all the instances 
not precisely tracked by the abstraction, and by introducing a special stuttering 
transition in which the environment is allowed to change non-deterministically. 
Second, we combine the abstraction with a method for automatically inferring 
candidate universal lemmas, which are used to strengthen the abstraction in case 
of spurious counterexamples. The candidate lemmas are obtained by generaliza- 
tion from the spuriousness proof carried out in a finite-domain instantiation of 
the concrete system. However, we do not require quantified reasoning to prove 
that they universally hold; rather, the algorithm takes into account the fact that 
candidate lemmas may turn out not to be universally valid. In such cases, the 
method is able to automatically discover such bad lemmas and discard them, by 
examining increasingly-higher-dimension bounded instances of the parametric 
system. 

We implemented the method in a tool called LAMBDA. At its core, LAMBDA 
leverages modern model checking approaches for quantifier-free infinite-state 
systems, i.e. the SMT-based approach of IC3 with implicit abstraction [4], in 
contrast to other approaches [19] where the abstract space is Boolean. In our 
experimental evaluation, we compared LAMBDA with the state-of-the-art tools 
MCMT [11] and CuBICLE [7]. The results show the advantage of the approach, 
that is able to solve multiple benchmarks that are out of reach for its competi- 
tors. 

The rest of the paper is structured as follows. In Section 2 we present some 
logical background, and in Section 3 we describe array-based systems. We give 
an informal overview of the algorithm in Section 4. In Section 5 we define the 
abstraction and state its formal properties. In Section 6 we discuss the approach 
to concretization and refinement, and we present the techniques for inferring 
candidate lemmas. We discuss the related work in Section 7, and we present 
our experimental evaluation in Section 8. Finally, in Section 9 we draw some 
conclusions and present directions for future work. For lack of space, the proofs 
of our theoretical results, as well as further details on our experiments, are 
reported in an extended techical report [5]. 


2 Preliminaries 


Our setting is standard first order logic. A theory 7 in the SMT sense is a pair 
T = (X,C), where X is a first order signature and C is a class of models over 
X. A theory 7 is closed under substructure if its class C of structures is such 
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that whenever M € C and N is a substructure of M, then M € C. We use the 
standard notions of Tarskian interpretation (assignment, model, satisfiability, 
validity, logical consequence). We refer to 0-arity predicates as Boolean variables, 
and to 0-arity uninterpreted functions as (theory) variables. A literal is an atom 
or its negation. A clause is a disjunction of literals. A formula is in conjunctive 
normal form (CNF) iff it is a conjuction of clauses. If x1,...,&n are variables 
and ¢ is a formula, we might write (x1, ..., £n) to indicate that all the variables 
occurring free in @ are in z1, ..., En- 

If ¢ is a formula, t is a term and v is a variable which occurs free in ¢, we write 
¢|[v/t] for the substitution of every occurrence of v with t. If t and v are vectors 
of the same length, we write ¢[v/t] for the simultaneous substitution of each v; 
with the corresponding term ¢t;. We use an if-then-else notation for formulae. 
We write if ¢, then yı elif ¢2 then wy elif ...wy,_1 else Yn to denote the 
formula (Qı => v1) N (“Qı N $2) = pa) A at (“Qı paui Adn—1 N adn) = Wn): 

Given a set of variables v, we denote with v’ the set {v’|v € v}. A symbolic 
transition system is a triple (v, I (v), T (v, v’)), where v is a set of variables, and 
I(v), T(v,v’) are first order formulae over some signature. An assignment to 
the variables in v is a state. A state s is initial iff it is a model of I(v), i.e. 
s H I(v). The states s,s’ denote a transition iff s U s’ = T(v,v’), also written 
T(s,s’). A path is a sequence of states so,51,... such that so is initial and 
T(s;,8;41) for all i. We denote paths with 7, and with 7[j] the j-th element of 
m. A state s is reachable iff there exists a path a such that [i] = s for some i. A 
variable v is frozen iff for all 7,7 it holds that rfi] (v) = 7[0](v). In the following, 
when we define a frozen variable v, we assume that this is done by having a 
constraint v’ = v as a top-level conjunct of the transition formula. A formula 
@(v) is an invariant of the transition system C = (v,/(v),T(v,v’)) iff it holds 
in all the reachable states. Following the standard model checking notation, we 
denote this with C — ¢(v).‘A formula ¢(v) is an inductive invariant for C iff 


I(v) = (v) and ev) A Tw, v") E ev). 


3 Modeling Parametric Systems as Array-based 
Transition Systems 


In order to describe parametric systems, we adapt from [10] the notion of array- 
based systems. In the following, we fix a theory of indexes Ty = (X7,Cr) and a 
theory of elements Tg = (Xg, Cp). In order to model the parameters, we require 
that the class Cr is closed under substructure. Then with A? we denote the 
theory whose signature is X = Xr U Xp U {[-]}, and a model for it is given by 
a set of total functions from a model of 7r to a model of Tg. In general, we 
can have several array theories with multiple sorts for indexes and elements. 


1 Note that we use the symbol |= with three different denotations: if œ, 7) are formulae, 
o H w denotes that w is a logical consequence of ¢; if u is an interpretation, and 
w is a formula, u |= w denotes that u is a model of w; if C is a transition system, 
C — wy denotes that ~w is an invariant of C. 
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For simplicity, we fix only an index sort and an elem sort. In the following, an 
array-based transition system 


C = (a,1(a),7(a,a')) 
is a symbolic transition system, with the additional constraints that: 


— ais a variable of sort index ++ elem. We use a single variable for the sake 
of simplicity: additional variables of arbitrary type (also of index or element 
type) can be added without loss of generality. 

— (a) is a first-order formula of the form Vi.d(i, a[é]), where i is of index sort 
and ¢ is a quantifier-free formula. 

— t(a,a’) is a finite disjunction of formulae, V?_,7%, such that every Tẹ is a 
formula of the following type (with i, j of index sort): 


V9. (2, j, ali], alj], a’ [i], a'[j]) 


with w a quantifier-free formula. 


This syntactic requirement subsumes the common guard and update formalism 
used for the description of parametric systems, used e.g in [10,12,15]. 

In the following, we shall refer to the disjuncts Tk of T as transition rules (or 
simply rules when clear from the context). 


An array-based transition system can be seen as a family of transition sys- 
tems, one for each cardinality of the finite models My, of 7r. In the following, 
given d an integer, we denote with C? the finite instance of C of size d obtained 
by instantiating the quantifiers of C over a set of fresh index variables of car- 
dinality d (considered implicitly different from each other). Note that this C4 
is a symmetric presentation [15]: if c = {c1,...,ca} are the fresh index vari- 
ables, and o is a permutation of c, we have that, for every formula ¢(c, a[c]), 


C* E o(¢,ale]) & C° E $(o(c), aa(o)]). 


Example 1 (Mutex Protocol for Ring Topology). Here we describe a simple pro- 
tocol for accessing a shared resource, with processes in a ring-shaped topol- 
ogy. As an index theory, we use the finite sets of integers. As an element the- 
ory, we use both the Booleans and an enumerated data type of two elements, 
namely {idle, critical}. The array variable t, with sort index +> boolean, is 
true in an index variable x if x holds the token. The variable s, with sort 
index ++ {idle, critical} holds the current state of the process. In addition, 
we have an integer frozen variable length, which represents the length of the 
ring. The transition system is described by the following formulae: 


Initial states. Initially, only one process holds the token, and every process is 
idle. We model this initial process with an additional constant init_token 
of sort index. Moreover, each index is bounded by the value of length. The 
initial formula is: 


Vj.plj] = idle ^j >1A 7 < length A length > 0 
if j = init_token then t|j] = true 
else t|j] = false 
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Transition rule 1. A process which holds the token can enter the critical sec- 
tion: 


Ji.s|i] = idle A t[i] = true A s'[i] = critical A t'[i] = t[i]A 
vi, j # i (s [j] = sli] At] = tl) 
Transition rule 2. A process exits from the critical section and passes the 
token to the process at its right: 


| 


Ji. A s{i] = critical A s'[i] = idle A t’ [i] = false^ 


if j =1Ai=length then s'[j] = s[j] A t'[j] = true 
Vj, j Ad. 4 elif 7 =i+1Ai < length then s'[j] = s[j] A t [j] = true 
else s'[j] = s[j] A t'[j] = tl] 


3.1 Universal invariant problem for array-based systems 


In the following, given an array-based transition system 
C = (a, (a), T(a,a’)), 


the universal invariant problem is the problem of proving (or disproving) that a 
formula of the form @ = Vi.¢(i, aļi]) is an invariant for C. 


Guard Strengthening In order to prove that Vi.d(i, a[i]) is an invariant of a 
system C = (a, (a), T(a,a')), we can first strengthen the rules of C by adding the 
candidate invariant in conjunction with the transition relation, and then prove 
that the formula is an invariant of the newly-restricted system. This induction 
principle is justified by the following proposition: 


Proposition 1 (Guard strenghtening [15]) Let C = (a,v(a),T(a,a’)) be a 
transition system and let ® be Vi.ġ(i, ali]). Let Co = (a, t(a), T(a, a’) A B) be the 
guard-strengthening of C with respect to P. Then, if B is an invariant of Co, it 
is also an invariant of C. 


Prophecy variables The universal quantifiers in the candidate invariant can 
be replaced with fresh frozen variables, called prophecy variables, that intuitively 
contain the indexes of the processes witnessing the violation of the property. 


Proposition 2 (Removing quantifiers [19]) Let C = (a,v(a),7(a,a’)) be an 
array-based system. The formula Vi.ġ(i, ali]) is an invariant for C iff the formula 
o(p, alp]) is an invariant for Cy» = (aUp, (a), T(a,a’)), where p is a set of fresh 
frozen variables of index sort. 


For better readability, in the following we will omit the subscript +p. More- 
over, we assume that the index variables universally quantified in the candidate 
invariant are considered to be different. This does not limit expressiveness, and 
simplifies our discourse. Therefore, the prophecy variables induced by a candi- 
date invariant are considered to be implicitly different. 
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Parameter 


Guard Abstraction 
Strengthening Prophecies Computation 


Refinement 
by Generalization 
Compute lemma from 


Inductive invariant 1% 


Find and fix yes 
bad lemmas 


Fig. 1. An overview of the algorithm. C is an array-based transition system; @ is a 
quantified candidate invariant; W E {v1,...Un} is the set of candidate lemmas; Cry 
is a quantified transition system resulting from the strengthening of C; Canw is a 
quantifier-free transition system. 


4 Overview of the Method 


In the following, let an array-based transition system C = (a,v(a),T(a,a’)), and 
a candidate universal invariant & = Vi.d(i, aļi]) for C be given. 

We now summarize the algorithm that attempts to solve the universal invari- 
ant problem for C and ®. The algorithm, depicted in Figure 1, iterates trying 
either to construct an abstraction sufficiently precise to prove the property (exit 
with SAFE), or to find a finite instantiation of the problem exhibiting a concrete 
counterexample (exit with UNSAFE). The abstract space is quantifier-free, and 
obtained by instantiating the universally quantified formulae over two sets of in- 
dex variables: the prophecy variables, which arise from the candidate invariant 
(as explained in Proposition 2), and are denoted with p; and the environmen- 
tal variables, denoted with x, which arise from the transition formula and are 
intended to represent the environment surrounding the p indexes, interacting 
with them in the behaviour leading to the violation. While prophecy variables 
are frozen, thus representing the same indexes for the whole run, environmental 
variables are free to change at each time step, hence producing possibly spuri- 
ous behaviours. The algorithm maintains a set of candidate lemmas Y = {W;};, 
composed of universally quantified formulae, that are used to strengthen the 
property and to tighten the abstraction. Initially, ¥ is empty. In the following, if 
C% is a finite instance of C and @ is a candidate universal invariant, with Ø? we 
denote the formula obtained from @ by instantiating the quantifiers in variables 
used for the domain of cardinality d. 
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At each iteration, we carry out the following high-level steps (described in 


detail in the next sections): 


the property ® to be proved is conjoined with the candidate lemmas in V, 
and its quantifiers are moved in prenex form;? 

we construct the guard-strengthening Co,w (cfr. Proposition 1), conjoining 
AY to the transition rules of C; 

we compute our modified Parameter Abstraction of Cg,w (defined in §5.1). 
First, we define the necessary prophecy variables p and environmental vari- 
ables x. Then, we instantiate the quantifiers obtaining the quantifier-free 
array transition system Cage. 

we (try to) solve the invariant checking problem Cane Hi & \W by calling a 
model checker for quantifier-free transition systems. Ë ^ W is obtained from 
P \W by removing quantifiers with prophecy variables, as in Proposition 2 
if the model checker concludes that there is no violation, then @ holds in C 
(for the properties of the Parameter Abstraction), and we exit with SAFE. 
otherwise, we try to check whether the property violation in the abstract 
space corresponds to a real counterexample. We do so by checking whether 
the current property ® A W is falsified in C%, a suitable finite instance of C. 
That is, we check whether C? = (8 A W)?. 

if C? H (WA ®)?, then the abstraction must be tightened. When the verifi- 
cation of the finite instance succeeds, an inductive invariant J? is produced, 
which is used to compute (candidate) lemmas by generalization from d to 
the universal case. 

if C4 A (WA ®)?, two cases are possible. First, we check if the (instantiation 
of the) property ® is indeed violated. If so, we exit with UNSAFE, and we pro- 
duce a concrete counterexample to the original problem, finitely witnessed 
in C4. 

However, it is also possible that C? does not violate 7, but it falsifies some 
lemmas. In fact, the candidate lemmas obtained at previous iterations, by 
generalization on C% with d~ Æ d, may not hold universally in C. In that 
case, the bad lemmas must be fixed, and the iteration is restarted. 


When the algorithm terminates with UNSAFE, we are able to exhibit a finite 


counterexample trace in a finite instance of C violating the property. When 
the algorithm terminates with safe, then the property holds in C. The result 
is obtained by the following chain of implications: from Theorem 3, stated in 
the next section, we have that Conw = = DAW implies Cay H AÜ. From 
Proposition 2, we have that Cay = & AW. Therefore, from Proposition 1, we 
have C = AW. In particular, we have C = @. 


5 


Modified Parameter Abstraction 


We describe here our Parameter Abstraction. The first version of this approach 
was introduced in [3], and later formalized in [15]. In the following, we describe 


2 In the following, with ^A W we denote the prenex form & ^ N vi 


138 Alessandro Cimatti, Alberto Griggio, and Gianluca Redondi 


a novel version of the abstraction, and how it can be applied to array-based 
transition systems. The main novelty is that, instead of using a special abstract 
index “x” that overapproximates the behaviour of the system in the array loca- 
tions that are not explicitly tracked, we use n environmental (index) variables 
which are not abstracted, but are allowed to change nondeterministically in some 
transitions. This can be achieved by the usage of an additional stuttering tran- 
sition: this rule allows the environmental variables to change value arbitrarily, 
while not changing the values of the array in the prophecies. 


5.1 Abstraction Computation 


Let an array-based transition system C and a universal invariant ® be given®. 


By conjoining ® to the transition rules in C, we obtain Co, the guard strength- 
ening of C with respect to &. Then, we define two sets of variables: the prophecy 
variables p, in number determined by Proposition 2, and the environmental vari- 
ables x, in number determined by the greatest existential quantification depth 
in the transition rules of Cs. While the prophecies are frozen variables, the in- 
terpretation of the environmental variables is not fixed. Moreover, we assume 
that the values taken by p and z are different. We now define C, the parameter 
abstraction of C. g 


Initial formula Let :(a) be Vi.ġ(i,aļi]), the initial formula of C in prenex 
form, with ġ(i, a[i]) quantifier-free. The initial formula of the abstract system is 
a quantifier-free first order formula, denoted (p, a[p]) obtained by instantiating 
all the universal quantifiers in ¿ over the set of prophecy variables p. 


Transition formula The transition formula of Ce is still represented by a 
disjuction of formulae of the form* 


def 


T(a,a’) = Jivj.y(i, j, ali], aly], a'li], a'[j]). 


For simplicity, we can assume that we have only one rule r(a,a’). First, we 
compute the set of all substitutions of the 2 over pU z, and we consider the set 
of formulae {7; (p, x,a, a')}, where j ranges over the substitutions, and T; is the 
result of applying the substitution to 7T. 

Then, for each formula in the set {7;}, we instantiate the universal quanti- 
fiers over the set pU x, obtaining a quantifier-free formula over prophecy and 
environmental variables. 

Moreover, we consider an additional transition formula, called the stuttering 
transition, defined by: 


#s = Napl =a] Av! =p 


3 These represent the system and the property in input to each iteration of the loop. 
* Possibly by performing trivial logical manipulations to distribute the guard strength- 
ening inside the rules. 
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The disjunction of all the abstracted transition formulae is the transition 
formula 7. So, we can now define the transition system 


C = ({a,p, x}, Tlp, alp]), 7(p, z, alp U z], a' [p U a). 


Example 2. We apply the abstraction procedure to the transition rule 2 of the 
token in the ring protocol of Example 1. 

Since the invariant is the formula Vi, j.~(s[i] = critical A s[j] = critical) it fol- 
lows that we have two prophecy variables p1, p2. Recall that the invariant itself is 
added to the transition as an additional conjunct. Since the existential quantifi- 
cation depth is one, we have only one environment variable xı. In the abstraction 
system we obtain three transition formulae from the original transition; we re- 
port the one indexed by the substitution mapping i into xı; such a formula is 
equivalent to the following: 


[a1] = crit A t[x1] = true A s'[x1] = idle A t’ [a1] = falseA 
if j = 1 A xı = length then s'[j] = s{j] 
/ 


VAN elif j = zı +1 Azı < length then s'[j] 
Jetmir) | else s'[j] = s{j] A t'i] = th 


A t' [j] = false 
= s|j] A t' [j] = false 


\ ~(s|i] = critical A s|j] = critical) 


4,9€{p1,p2,01} 
WA 


5.2 Stuttering Simulation 


We define here the stuttering simulation induced by our version of the Parameter 
Abstraction. The proof of the main theorem can be found in the appendix. The 
stuttering is induced by 7g: this is a weaker version than the simulation induced 
by [15], yet it is sufficient for preserving invariants. 


Definition 1 (Stuttering simulation) Given two symbolic transition systems 
Ci = (z1, 41,71) and C2 = (2, 2,72), with sets of states Sı and S2, a stuttering 
simulation S is a relation S C Sı X S2, such that: 


— for every sı E€ Sı such that sı = 41, there exists some s2 E€ S2 such that 
($1, 2) E€ S and s2 = ta; 

— for every (81,82) E€ S, and for every s} E€ Sı such that sı Us = 71, there 
exists either some sh E€ S2 such that (s1, s4) E€ S and s2 U sh = T2, or some 
(85,84) € S2 X Sp such that (s1, 53) € S, and s2 U sh = 7, sh U s3 Et. 


If such a relation exists, we say that Co stutter simulates Ci. 


We write S(s1) for {s2|(s1,s2)} E€ S. We recall that stutter simulation pre- 
serves reachability, i.e. if Co stutter simulates C4, then if sı is reachable in Ci 
then the set S(s1) is reachable in C2. Formally, the stuttering simulation induced 
by the Parameter Abstraction is defined as follows. 
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Definition 2 (Simulation) Let C be the original transition system and let G 
be its Parameter Abstraction. Let s and § denote states of C and C, respectively. 
We define S as follows: 


S(s, 3) iff s(a)[i] = 3(a)[i] for alli € (J &(p). 


pep 


Intuitively, we require that in the concrete state s and the abstract state 5, 
the array is interpreted in the same way for all the locations referred by the 
prophecy variables. We then have the following: 


Theorem 3. The relation S is a stuttering simulation between C and Č. More- 
over, if C = B(p,alp]), then C — (p, alp)). 


6 Refinement 


If S(p, a[p]) does not hold in C, in general we cannot conclude anything, since the 
abstraction could be too coarse. So, if an abstract counterexample is encountered, 
we try to explore a small instance of the system to see if this counterexample 
occurs in it. To choose the appropriate size, our algorithm keeps a counter d, 
whose value is equal to the size to explore. Initially, d is equal to the number 
of (universally-quantified) index variables in the property #.” When an abstract 
counterexample is encountered, we check whether C? | (@AW)?. For this check, 
we use a model checker able to return, in case of success, an inductive invariant 
I?. From the inductive invariant we compute some first order formulae J which 
will be a new set of candidate lemmas. We will see later how to obtain this 
generalization. After computing the new lemmas, we set d = d+ 1. If a con- 
crete counterexample is found, then there are two cases: (i) the counterexample 
falsifies the original property, and we exit from the algorithm with a concrete 
counterexample; (iz) the counterexample falsifies some lemmas; in this case we 
remove the lemma and restart the loop (without changing d). 


6.1 From Invariants to Universal Lemmas 


Definition 3 Let d be an integer, and let I? be a set of clauses containing 
d variables. A generalization of I¢ is a first-order formula J such that, when 
evaluating the quantifiers in J in a domain with precisely d elements, we obtain 
a formula equivalent to I¢. 


We use the following technique for generalization. Suppose that J? is in CNF, 
and that we used c,,...,Cq as variables for an instance with d elements. Then, 
T4=CA-:-AC, isa conjunction of clauses. From each of those clauses we 


5 Recall that we assume that quantified index variables are required to be different. 
Therefore, the property holds vacuously on instances of size smaller than the number 
of index variables in ®. 
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will obtain a new candidate lemma. Let AllDiff(z) be the formula which states 
that all variables in i are different from each other. Since every C@ is given by 


a symmetric presentation [15], we have that, for every i € {1,...,n}, C4 = 
Vi1,..-,tp-AUDUff (i1,... ih) > Cili, ..., ih), where the quantifiers range over 
C1,...,€q and h < d is the number of variables which occur in C;. This means 


def 


that J = N; Vi. AllDiff (i) > C;(i) is a generalization of I“. In our algorithm, we 
add the set {Vi.C;(i)}"_, of new candidate lemmas to Y. Note that we omitted 
the formula AllDiff for our assumption on the different values of index variables. 


Fixing Unsound Lemmas Unfortunately, we know a priori that a lemma 
holds only for the instance from which it was generalized. In general, its universal 
generalization obtained as outlined above might not hold in the system. 

Suppose that the formula yı is a candidate lemma, obtained by generaliza- 
tion after the successful verification of an instance of size d. Suppose that later, 
a counterexample for pı is found by exploring a different instance ce (with 
d' > d). This means that the lemma Yı does not hold universally, but only for 
some finite instances of the system (including C%), and not in general. In this 
case, we simply remove Yı from the set of candidate lemmas W, thus effectively 
weakening our working property (from ®@AW to &A (W \ {41 })). While this may 
cause a particular (abstract) counterexample to be encountered more than once 
during the main loop of the algorithm, since the finite instances are explored 
monotonically and their size d is increased after every successful verification 
of a bounded instance, the overall procedure still makes progress by exploring 
increasingly-large instances of the system. The hope is that eventually the algo- 
rithm will discover enough good lemmas that block the abstract counterexample. 
This notion of (weak) progress is justified by the following: 


Proposition 4 Let 7 be an abstract counterexample, W be the current set of 
universally quantified lemmas, and d be the size of the bounded instance to es- 
plore. During every execution of the algorithm, the same triple (t, W, d) never 
occurs twice. 


7 Related Work 


Parametric verification is a challenging problem, and there is a large body of 
work in the literature devoted to this problem. Here, we (necessarily) focus on 
the approaches that are most related to ours. 

Several methods are based on quantifier elimination using decidable frag- 
ments of first order logic, with notable examples in [7, 10,22]. These methods 
guarantee a high degree of automation, but typically impose strong syntactic 
requirements in the input problem, and may suffer from scalability issues. A 
second popular approach is based on abstraction and abstraction refinement. 
Within this family of abstractions, earlier versions of the Paramater Abstrac- 
tion [3,15] have been used successfully also for industrial protocols [24]. The 
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main drawback is that the degree of automation is limited, and substantial ex- 
pertise is required to obtain the desired results. The first steps of our abstraction 
algorithm are inspired by the ones in [19] and [15]. The key difference from [19] 
is that in that work the abstract transition system C is given by an eager propo- 
sitional abstraction, with the axioms of the background theories recovered by 
the usage of some schemata. Here we retain the theory of arrays in the abstract 
space Č. Moreover, differently from both [15] and [19], our procedure includes 
an automatic refinement of the abstraction in a counterexample-driven manner. 


Ivy [20,22] implements both semi-automatic invariant checking with decid- 
able logics (namely, Effectively Propositional Logic - EPR) and compositional 
abstraction with eager axioms [19]. MyPyvy [13,14] is a model checker inspired 
by the language of Ivy. It implements a version of IC3 capable of dealing with uni- 
versal formulas [13]; the algorithm is completely automatic, but it is still based 
on quantifier elimination via reduction to decidable logics. In a more recent work, 
MyPyvy has gained the capability of inferring invariants with quantifier alter- 
nations, using a procedure that combines separators and first-order logic [14]. At 
the moment, our framework is capable of handling only universally quantified 
invariants. On the other hand, our approach is not limited to EPR, but it can 
in principle handle formulae with arbitrary SMT theories. 


Exploring small instances of a parameterized system for candidate lemmas 
is a popular approach for parametric verification. In [8], this idea is used to 
over-approximate backward reachable states inside an algorithm which combines 
backward search and quantifier elimination. In [16], a finite-instance exploration 
is used together with a theorem prover to check the validity of candidate lemmas. 
In [17], candidate invariants are obtained from the set of reachable states of 
small instances. Similarly to our approach, these lemmas are used to strengthen 
an earlier version of the parameter abstraction. However, human intervention is 
still needed for the refinement. 


A similar approach is presented in [23], where lemmas are obtained from a 
generalization of the proof of the property in a small instance of the protocol. 
The main difference with our technique, besides the methods used to extract 
such invariants, is the following: in [23], the authors show that to prove that a 
property (conjoined with lemmas) is inductive for all N, it is enough to prove 
that it is inductive for a particular No, which is computable from the number 
of variables in the description of the system. This result is obtained from the 
imposed syntactic structure of the system. On the other hand, we impose less 
structure, and we rely on proving the property in an abstract version (and not 
a concrete instance) of the system. Moreover, our approach is integrated in an 
abstraction/refinement loop, which is missing from [23]. 


Another SMT-based approach for parametric verification is in [12]. The 
method is based on a reduction of invariant checking to the satisfiability of 
non-linear Constrained Horn Clauses (CHCs). Besides differing substantially in 
the overall approach, the method is more restrictive in the input language, and 
handles invariants only with a specific syntactic structure. 
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The use of prophecy variables for inferring universally quantified invariants 
has been explored also in non-parametric contexts, such as [18]. The main dif- 
ference with our work is that [18] focuses on finding quantified invariants for 
quantifier-free transition systems with arrays, rather than array-based systems 
with quantifiers. The overall abstraction-refinement approach is also substan- 
tially different. 


8 Experimental Evaluation 


We have implemented our algorithm in a tool called LAMBDA (for Learning 
Abstractions froM BoundeD Analysis). LAMBDA is written in Python, and 
uses the SMT-based IC3 with implicit predicate abstraction of [4] as underly- 
ing quantifier-free verification engine. LAMBDA accepts as input array-based 
systems specified either in the language of MCMT [11] or in VMT format (a 
light-weight extension of SMT-LIB to model transition systems [25]). In case 
of successful termination, LAMBDA generates either a counterexample trace (for 
violated properties) in a concrete instance of the parametric system, or a quanti- 
fied inductive invariant that proves the property for any instance of the system. 
In the latter case, LAMBDA can also generate proof obligations that can be in- 
dependently checked with an SMT solver supporting quantifiers, such as Z3 [21] 
or CVC4 [2]. More specifically, the quantified inductive invariant can be gener- 
ated by LAMBDA by simply universally quantifying all the (index) variables in 
the inductive invariant generated for Ĉ, and conjoining it with the lemmas Y 
discovered during the main loop iterations. Computing such an invariant is im- 
mediate after the termination of the algorithm, and does not require additional 
reasoning. 

In order to evaluate the effectiveness of our method, we have compared 
LAMBDA with two state-of-the-art tools for the verification of array-based sys- 
tems, namely CUBICLE [7] and MCMT. We could not include MyPyvy in the 
comparison, due to the many differences in input languages and modeling for- 
malisms, which make an automatic translation of the benchmarks very difficult. 
We would also have liked to compare with the technique of [12], however the 
prototype tool mentioned in the paper doesn’t seem to be available. 

For our evaluation, we have collected a total of 116 benchmarks, divided in 
three different groups: 

Protocols consists of 42 instances taken from the MCMT or the CUBICLE dis- 
tributions, and used in previous works on verifcation of array-based systems. We 
have used all the instances which were available in both input formats, and we 
have split benchmarks containing multiple properties into different files. 

DynArch consists of 57 instances of verification problems of dynamic architec- 
tures, taken from [6]. These benchmarks make use of arithmetic constraints on 


6 In our implementation, we use the theory of integers as an index theory. At first, this 
may seem odd, since we should consider all finite subsets of the integers. However, 
this is not a problem, since the satisfiability of a quantifier-free UFLIA formula is 
equivalent to its satisfiability in a finite index model. 
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Table 1. Summary of experimental results. 


Lambda MCMT Cubicle 
Benchmark family|# of instances|Solved Unique|Solved Unique|Solved Unique 
Protocols 42 34 3 24 (0) 30 1 
DynArch 57 48 5 48 5 = = 
Trains 17 17 rs = = = = 


index terms, which are not supported by CUBICLE. Therefore, we could only 
compare LAMBDA with MCMT on them. 

Trains consists of 17 instances derived by (a simplified version of) verifica- 
tion problems on railway interlocking logics [1]. These benchmarks make use of 
several features that are not fully supported by CUBICLE and MCMT (such 
as non-functional updates in the transition relation, transition rules with more 
than one universally-quantified variable, real-valued variables). None of such re- 
strictions applies to LAMBDA, which in general accepts models with significatly 
fewer syntactic constraints than CUBICLE and MCMT. Since these instances 
are inspired by relevant real-world verification problems, we believe that it is 
interesting to include them in the evaluation even though we could only run 
LAMBDA on them. 

Our implementation, all the benchmarks, and the scripts for reproducing the 
results are available at http://es.fbk.eu/people/griggio/papers/cade21-lambda. 
tar.gz. We have run our experiments on a cluster of machines with a 2.90GHz 
Intel Xeon Gold 6226R CPU running Ubuntu Linux 20.04.1, using a time limit of 
1 hour and a memory limit of 4GB for each instance. We have used the default 
settings for MCMT, whereas for CUBICLE we have also enabled the BRAB 
algorithm.’ A summary of the results of our evaluation are presented in Table 1. 
More details are provided in our extended version [5]. 

Overall, LAMBDA is very competitive with the state of the art, and in fact 
it solves the largest number of instances (even when disregarding the Trains 
group, which cannot be handled by the other tools).When considering the Pro- 
tocols group, CUBICLE is often significantly faster than LAMBDA, especially on 
easier problems, thanks to its explicit-state exploration component (part of the 
BRAB algorithm). However, the symbolic techniques used by LAMBDA allow it to 
generally scale better to larger, more challenging problems: in the end, LAMBDA 
solves 4 more instances than CUBICLE, and 10 more than MCMT. The situa- 
tion is different for the DynArch group, in which LAMBDA and MCMT solve the 
same number of instances. However, it is interesting to observe that both tools 
can solve 5 instances that the other tool cannot solve; more in general, it seems 
that the two approaches have somewhat complementary strengths. Moreover, as 
already stated above, the fact that LAMBDA imposes significantly less syntactic 
restrictions than the other two tools considered allowed it to handle all the in- 
stances of the Trains group, which cannot be easily modeled in the languages of 
MCMT or CUBICLE. 


7 The results reported were obtained using -brab 2; we have however experimented 
also with other (small) values for -brab, without noticing any significant difference. 


Invariant Checking of Parametric Systems via Quantifier-free SMT 145 


Finally, we wish to remark that we have generated SMT proof obligations for 
checking the correctness of all the (universally quantified) inductive invariants 
produced by LAMBDA, and checked them with both CVC4 and Z3. None of the 
solvers reported any error, and overall the combination of the two solvers was able 
to successfully verify all the proof obligations for 65 of the 67 instances reported 
as safe. We believe that the fact that we can easily produce proof obligations 
that can be independently checked is another strength of our approach. This is 
in contrast to the approach of CUBICLE, where generating proof obligations is 
nontrivial [9]. 


9 Conclusions 


In this paper we tackled the problem of universal invariant checking for paramet- 
ric systems. We proposed a fully-automated abstraction-refinement approach, 
based on quantifier-free reasoning. The abstract model, that stutter simulates 
the concrete model, is a quantifier-free symbolic transition system refined by (the 
instantiation of) candidate universal lemmas. These are obtained by analyzing 
the proofs of validity of the property in a finite instance of the parametric system. 
We experimentally evaluated an implementation on standard benchmarks from 
the literature. The results show the effectiveness of the method, also in compar- 
ison with state-of-the-art tools (CUBICLE, MCMT). We are able to prove, in a 
fully automated manner and without manual intervention, several benchmarks 
that are considered challenging. In the future, we plan to work on generalization, 
to improve the ability of inferring the right lemmas from a small instance, and 
to find more effective ways to filter out bad candidates. On the theoretical side, 
we will investigate the relation between the termination of the algorithm and 
decidable classes of parametric systems (e.g. those that enjoy a cut-off prop- 
erty). Finally, we will work on the verification of temporally extended properties 
which are also preserved by stuttering simulations (such as fragments of Linear 
Temporal Logic). 
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Abstract. We make two contributions to the study of polite combina- 
tion in satisfiability modulo theories. The first is a separation between 
politeness and strong politeness, by presenting a polite theory that is not 
strongly polite. This result shows that proving strong politeness (which 
is often harder than proving politeness) is sometimes needed in order to 
use polite combination. The second contribution is an optimization to 
the polite combination method, obtained by borrowing from the Nelson- 
Oppen method. The Nelson-Oppen method is based on guessing arrange- 
ments over shared variables. In contrast, polite combination requires an 
arrangement over all variables of the shared sorts. We show that when 
using polite combination, if the other theory is stably infinite with re- 
spect to a shared sort, only the shared variables of that sort need be 
considered in arrangements, as in the Nelson-Oppen method. The time 
required to reason about arrangements is exponential in the worst case, 
so reducing the number of variables considered has the potential to im- 
prove performance significantly. We show preliminary evidence for this 
by demonstrating a speed-up on a smart contract verification benchmark. 


1 Introduction 


Solvers for satisfiability modulo theories (SMT) [5] are used in a wide variety 
of applications. Many of these applications require determining the satisfiability 
of formulas with respect to a combination of background theories. In order to 
make reasoning about combinations of theories modular and easily extensible, 
a combination framework is essential. Combination frameworks provide mecha- 
nisms for automatically deriving a decision procedure for the combined theories 
by using the decision procedures for the individual theories as black boxes. To 
integrate a new theory into such a framework, it then suffices to focus on the de- 
coupled decision procedure for the new theory alone, together with its interface 
to the generic combination framework. 

In 1979, Nelson and Oppen [16] proposed a general framework for combining 
theories with disjoint signatures. In this framework, a quantifier-free formula in 
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the combined theory is purified to a conjunction of formulas, one for each theory. 
Each pure formula is then sent to a dedicated theory solver, along with a guessed 
arrangement (a set of equalities and disequalities that capture an equivalence re- 
lation) of the variables shared among the pure formulas. For completeness [15], 
this method requires all component theories to be stably infinite. While many 
important theories are stably infinite, some are not, including the widely-used 
theory of fixed-length bit-vectors. To address this issue, the polite combination 
method was introduced by Ranise et al. [17], and later refined by Jovanovic 
and Barrett [12]. In polite combination, one theory must be polite, a stronger 
requirement than stable-infiniteness, but the requirement on the other theory is 
relaxed: specifically, it need not be stably infinite. The price for this generality is 
that unlike the Nelson-Oppen method, polite combination requires guessing ar- 
rangements over all variables of certain sorts, not just the shared ones. At a high 
level, polite theories have two properties: smoothness and finite witnessability 
(see Section 2). The polite combination theorem in [17] contained an error, which 
was identified in [12]. A fix was also proposed in [12], which relies on stronger 
requirements for finite witnessability. Following Casal and Rasga [8], we call this 
strengthened version strong finite witnessability. A theory that is both smooth 
and strongly finitely witnessable is called strongly polite. 


This paper makes two contributions. First, we give an affirmative answer to 
the question of whether politeness and strong politeness are different notions, by 
giving an example of a theory that is polite but not strongly polite. The given 
theory is over an empty signature and has two sorts, and was originally studied 
in [8] in the context of shiny theories. Here we state and prove the separation 
of politeness and strong politeness, without using shiny theories. Proving that a 
theory is strongly polite is harder than proving that it is just polite. This result 
shows that the additional effort is sometimes needed in order to be able to use the 
combination theorem from [12]. We show that for empty signatures, at least two 
sorts are needed to present a polite theory that is not strongly polite. However, 
for the empty signature with only one sort, there is a finitely witnessable theory 
that is not strongly finite witnessable. Such a theory cannot be smooth. 


Second, we explore different polite combination scenarios, where additional 
information is known about the theories being combined. In particular, we im- 
prove the polite combination method for the case where one theory is strongly 
polite w.r.t. a set S' of sorts and the other is stably infinite w.r.t. a subset S’ C S 
of the sorts. For such cases, we show that it is possible to perform Nelson-Oppen 
combination for S” and polite combination for S' \ S”. This means that for the 
sorts in S’, only shared variables need to be considered for the guessed arrange- 
ment, which can considerably reduce its size. We also show that the set of shared 
variables can be reduced for a couple of other variations of conditions on the the- 
ories. Finally, we present a preliminary case study using a challenge benchmark 
from a smart contract verification application. We show that the reduction of 
shared variables is evident and significantly improves the solving time. Verifica- 
tion of smart contracts using SMT (and the analyzed benchmark in particular) 
is the main motivation behind the second contribution of this paper. 
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Related Work: Polite combination is part of a more general effort to replace the 
stable infiniteness symmetric condition in the Nelson-Oppen approach with a 
weaker condition. Other examples of this effort include the notions of shiny [21], 
parametric [13], and gentle [11] theories. Gentle, shiny, and polite theories can 
be combined à la Nelson-Oppen with any arbitrary theory. Shiny theories were 
introduced by Tinelli and Zarba [21] as a class of mono-sorted theories. Based 
on the same principles as shininess, politeness is particularly well-suited to deal 
with theories expressed in many-sorted logic. Polite theories were introduced by 
Ranise et al. [17] to provide a more effective combination approach compared 
to parametric and shiny theories, the former requiring solvers to reason about 
cardinalities and the latter relying on expensive computations of minimal car- 
dinalities of models. Shiny theories were extended to many-sorted signatures 
in [17], where there is a sufficient condition for their equivalence with polite 
theories. For the mono-sorted case, a sufficient condition for the equivalence of 
shiny theories and strongly polite theories was given by Casal and Rasga [7]. 
In later work [8], the same authors proposed a generalization of shiny theories 
to many-sorted signatures different from the one in [17], and proved that it is 
equivalent to strongly polite theories with a decidable quantifier-free fragment. 
The strong politeness of the theory of algebraic datatypes [4] was proven in [18]. 
That paper also introduced additive witnesses, that provided a sufficient con- 
dition for a polite theory to be also strongly polite. In this paper we present a 
theory that is polite but not strongly polite. In accordance with [18], the witness 
that we provide for this theory is not additive. 

The paper is organized as follows. Section 2 provides the necessary notions 
from first-order logic and polite theories. Section 3 discusses the difference be- 
tween politeness and strong politeness and shows they are not equivalent. Sec- 
tion 4 gives the improvements for the combination process under certain condi- 
tions, and Section 5 demonstrates the effectiveness of these improvements for a 
challenge benchmark. 4 


2 Preliminaries 


2.1 Signatures and Structures 


We briefly review the usual definitions of many-sorted first-order logic with 
equality (see [10,19] for more details). A signature X consists of a set Sy (of 
sorts), a set Fy of function symbols, and a set Ps of predicate symbols. We as- 
sume Sy, Fy and Ps are countable. Function symbols have arities of the form 
01 X...X On > 0, and predicate symbols have arities of the form g1 X ... X On, 
with o1,...,0n,0 E Sy. For each sort o € Sy, Py includes an equality symbol 
=, of arity ø x o. We denote it by = when ø is clear from context. When =, 
are the only symbols in X, we say that X is empty. If two signatures share no 
symbols except =, we call them disjoint. We assume an underlying countably 


4 Due to space constraints, some proofs are omitted. They can be found in an 


extended version at https://arxiv.org/abs/2104.11738. 
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infinite set of variables for each sort. Terms, formulas, and literals are defined in 
the usual way. For a X-formula ¢ and a sort a, we denote the set of free variables 
in ¢ of sort o by vars,(¢). This notation naturally extends to varss(¢) when S 
is a set of sorts. vars(@) is the set of all free variables in ¢. We denote by QF (X) 
the set of quantifier-free X-formulas. 

A 3/-structure is a many-sorted structure that provides semantics for the 
symbols in X (but not for variables). It consists of a domain o^ for each sort 
o € Sy, an interpretation f^ for every f € Fy, as well as an interpretation P4 
for every P € Py. We further require that =, be interpreted as the identity 
relation over o^ for every o € Sy. A 5-interpretation A is an extension of a 
-structure with interpretations for some set of variables. For any X-term a, 
a^ denotes the interpretation of a in A. When a is a set of X-terms, a4 = 
fga | x € a}. Satisfaction is defined as usual. A = y denotes that A satisfies y. 


A -theory 7 is a class of all X-structures that satisfy some set Aa of 
5/-sentences. For each such set Ax, we say that T is axiomatized by Ax. A X- 
interpretation whose variable-free part is in 7 is called a 7-interpretation. A 
+/-formula ¢ is TJ-satisfiable if A = @¢ for some 7-interpretation A. A set A of 
X-formulas is T-satisfiable if A = ¢ for every ¢ € A. Two formulas ¢ and w are 
T -equivalent if they are satisfied by the same 7-interpretations. 


Note that for any class C of X-structures there is a theory Jc that corresponds 
to it, with the same satisfiable formulas: the X-theory axiomatized by the set 
Ax of 5/-sentences that are satisfied in every structure of C. In the examples that 
follow, we define theories 7e implicitly by specifying only the class C, as done in 
the SMT-LIB 2 standard [2]. This can be done without loss of generality. 


Example 1. Let Xis be a signature of finite lists containing the sorts elem,, 
elem, and list, as well as the function symbols cons of arity elem, x elemg x list > 
list, car; of arity list — elem,, carg of arity list + elemg, cdr of arity list — list, 
and nil of arity list. The List-theory TList corresponds to an SMT-LIB 2 theory 
of algebraic datatypes [2,4], where elem; and elema are interpreted as some sets 
(of “elements” ), and list is interpreted as finite lists of pairs of elements, one 
from elem; and the other from elemg. cons is a list constructor that takes two 
elements and a list, and inserts the two elements at the head of the list. The 
pair (carı (l), caro(l)) is the first entry in J, and cdr(l) is the list obtained from | 
by removing its first entry. nil is the empty list. 


Example 2. The signature X}y,_ includes a single sort int, all numerals 0,1,..., 
the function symbols +, — and - of arity int x int —> int and the predicate 
symbols < and < of arity int x int. The },4-theory Tip, corresponds to integer 
arithmetic in SMT-LIB 2, and the interpretation of the symbols is the same as 
in the standard structure of the integers. The signature Xgvy4 includes a single 
sort BV4 and various function and predicate symbols for reasoning about bit- 
vectors of length 4 (such as & for bit-wise and, constants of the form 0110, etc.). 
The gya-theory Tgy4 corresponds to SMT-LIB 2 bit-vectors of size 4, with the 
expected semantics of constants and operators. 
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Let X1, Xə be signatures, 7; a X1-theory, and 72 a Xə-theory. The combina- 
tion of Ti and 72, denoted Ti 672, consists of all X1 U X9-structures A, such that 
A% is in Ti and A*? is in Jz, where A* is the reduct of A to X; for i € {1,2}. 


Example 3. Let Tinspva be Tint 8 Tava. It is the combined theory of integers and 
bit-vectors. It has all the sorts and operators from both theories. If we rename 
the sorts elem; and elemz of Xrist to int and BV4, respectively, we can obtain a 
theory TListintgva defined as Tintpva ® Thist- This is the theory of lists of pairs, 
where each pair consists of an integer and a bit-vector of size 4. 


The following definitions and theorems will be useful in the sequel. 


Theorem 1 (Theorem 9 of [19]). Let X be a signature, and A a set of X- 
formulas that is satisfiable. Then there exists an interpretation A that satisfies 
A, in which oA is countable whenever it is infinite.” 


Definition 1 (Arrangement). Let V be a finite set of variables whose sorts 
are in S and let {V> | a E€ S} be a partition of V such that V, is the set of 
variables of sort o in V. A formula ô is an arrangement of V if 


= ACA @=y) A A (c#y)), 
o€S (a,y)E Eo x yeVo(x,y)¢Eo 
where Es is some equivalence relation over Vy for each o € S. 
The following theorem from [12] is a variant of a theorem from [20]. 


Theorem 2 (Theorem 2.5 of [12]). Fori = 1,2, let X; be disjoint signatures, 
Si = Sy, with S = Sı N S2, Ti be a Xi-theory, T; be a set of Xi-literals, and 
V = vars(I\)Nvars (Ly). If there exist a Ti -interpretation A, a Tz interpretation 
B, and an arrangement dy of V such that: 1. A = I Udy; 2. BE In U ôy; and 
3. |Ao| = |Bo| for every o € S, then Tı U I> is Ty ® To-satisfiable. 


2.2 Polite Theories 


We now give the background definitions necessary for both Nelson-Oppen and 
polite combination. In what follows, X is an arbitrary (many-sorted) signature, 
S C Ss, and 7 is a X-theory. We start with stable infiniteness and smoothness. 


Definition 2 (Stably Infinite). 7 is stably infinite with respect to S if ev- 
ery quantifier-free ’-formula that is T-satisfiable is also satisfiable in a T- 
interpretation A in which o^ is infinite for every o € S. 


Definition 3 (Smooth). T is smooth w.r.t. S if for every quantifier-free for- 
mula ¢, T -interpretation A that satisfies 6, and function k from S to the class of 
cardinals such that K(o) > |oA| for every o € S, there exists a T -interpretation 


A’ that satisfies b with lo“ =K(o) for everyo € S. 


5 In [19] this was proven more generally, for ordered sorted logics. 
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We identify singleton sets with their single elements when there is no ambiguity 
(e.g., when saying that a theory is smooth w.r.t. a sort o). 

We next define politeness and related concepts, following the presentation 
in [18]. Let ¢ be a quantifier-free ©-formula. A X-interpretation A finitely wit- 
nesses & for T w.r.t. S (or, is a finite witness of for T w.r.t. S), if AE @ 
and o^ = vars,(¢)4 for every o € S. We say that ¢ is finitely witnessed for 
T w.r.t. S if it is either J-unsatisfiable or has a finite witness for T w.r.t. S. 
We say that ¢ is strongly finitely witnessed for T w.r.t. S if ọ A dy is finitely 
witnessed for 7 w.r.t. S for every arrangement dy of V, where V is any set of 
variables whose sorts are in S. A function wit : QF(2’) > QF(2) is a (strong) 
witness for T w.r.t. S if for every 6 E€ QF (2) we have that: 1. ¢ and 3 w. wit(ġ) 
are T-equivalent for W = vars(wit(¢)) \ vars(d); and 2. wit(¢) is (strongly) 
finitely witnessed for T w.r.t. S. T is (strongly) finitely witnessable w.r.t. S if 
there exists a computable (strong) witness for T w.r.t. S. T is (strongly) polite 
w.r.t. S if it is smooth and (strongly) finitely witnessable w.r.t. S. 


3 Politeness and Strong Politeness 


In this section, we study the difference between politeness and strong politeness. 
Since the introduction of strong politeness in [12], it has been unclear whether 
it is strictly stronger than politeness, that is, whether there exists a theory 
that is polite but not strongly polite. We present an example of such a theory, 
answering the open question affirmatively. This result is followed by further 
analysis of notions related to politeness. This section is organized as follows. 
In Section 3.1 we reformulate an example given in [12], showing that there are 
witnesses that are not strong witnesses. We then present a polite theory that 
is not strongly polite in Section 3.2. The theory is over a signature with two 
sorts that is otherwise empty. We show in Section 3.3 that politeness and strong 
politeness are equivalent for empty signatures with a single sort. Finally, we show 
in Section 3.4 that this equivalence does not hold for finite witnessability alone. 


3.1 Witnesses vs. Strong Witnesses 


In [12], an example was given for a witness that is not strong. We reformulate 
this example in terms of the notions that are defined in the current paper, that is, 
witnessed formulas are not the same as strongly witnessed formulas (Example 4), 
and witnesses are not the same as strong witnesses (Example 5). 


Example 4. Let Xo be a signature with a single sort ø and no function or pred- 
icate symbols, and let Jọ be a Xo-theory consisting of all Xo-structures with at 
least two elements. Let ġ be the formula x = xz A w = w. This formula is finitely 
witnessed for Jo w.r.t. o, but not strongly. Indeed, for dy = (x = w), dA dy is 
not finitely witnessed for Jọ w.r.t. o: a finite witness would be required to have 
only a single element and would therefore not be a 7o-interpretation. 


The next example shows that witnesses and strong witnesses are not equivalent. 
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Example 5. Take Xo, 0, and To as in Example 4, and define wit(@) as the function 
(6 A wi =w, A we = w2) for fresh w1, w2. The function is a witness for Tọ 
w.r.t. o. However, it is not a strong witness for 7 w.r.t. ø. 


Although the theory Jọ in the above examples does serve to distinguish for- 
mulas and witnesses that are and are not strong, it cannot be used to do the 
same for theories themselves. This is because Jọ is, in fact, strongly polite, via 
a different witness function. 


Example 6. The function wit’(¢) = (@A w, 4 w2), for some w1, w2 ¢ vars,(¢), 
is a strong witness for 7p w.r.t. S, as proved in [12]. 


A natural question, then, is whether there is a theory that can separate the two 
notions of politeness. The following subsection provides an affirmative answer. 


3.2 A Polite Theory that is not Strongly Polite 


Let Xə be a signature with two sorts a, and o2 and no function or predicate 
symbols (except =). Let 72,3 be the Xə-theory from [8], consisting of all Xə- 
structures A such that either lof] =2\ [o| > Ro or lof] ZBA [o$] > 3 [8].6 

72,3 is polite, but is not strongly polite. Its smoothness is shown by extending 
any given structure with new elements as much as necessary. 


Lemma 1. 723 is smooth w.r.t. {01,02}. 


For finite witnessability, consider the function wit defined as follows: 
wit(d) = @At = £1 A T2 = T2 A £3 = 23 A Yı = Y1 Ay =yoNy3=y3 (1) 


for fresh variables z1, £2, and x3 of sort cı and y1, y2, and y3 of sort o2. It can 
be shown that wit is a witness for 72,3 but there is no strong witness for it. 


Lemma 2. 723 is finitely witnessable w.r.t. {01,02}. 


Lemma 3. 723 is not strongly finitely witnessable w.r.t. {01,02}. 


Lemmas 1 to 3 have shown that 72.3 is polite but is not strongly polite. And 
indeed, using the polite combination method from [12] with this theory can cause 
problems. Consider the theory 7); that consists of all Xə-structures A such that 
lof] = |o3'| = 1. Clearly, 7,172.3 is empty, and hence no formula is Ty 1 ® 72,3- 
satisfiable. However, denote the formula true by I and the formula x = x by 
I for some variable x of sort o1. Then wit(I)) is £ = sAN t= Uj AY, = Yi 
Let 6 be the arrangement £ = £1 = £2 = £3 ^ Yı = Y2 = y3. It can be shown that 
wit(IT2) ^ô is T2, -satisfiable and I Að is Ji 1-satisfiable. Hence the combination 
method of [12] would consider I, A I> to be 7,1 ® 72,3-satisfiable, which is 
impossible. Hence the fact that 72,3 is not strongly polite propagates all the way 
to the polite combination method.’ 


6 In [8], the first condition is written jo | > 2, We use equality as this is equivalent 
and we believe it makes things clearer. 
T Notice that T2, can be axiomatized using the following set of axioms, given the 


definitions in Figure 1: { Sy al U {2 > yz |n > 3} 
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distinct (tisi tna) = /\ Ui FX; 
1<i<j<=n 
Won = da1,...,Un.distinct(x1,..., tn) 


n 


Wn = dai,...,tn.Vy. Vy y= fi 


i=1 


Won = Won A Wen 


Fig. 1. Cardinality formulas for sort ø. All variables are assumed to have sort ø. 


Remark 1. An alternative way to separate politeness from strong politeness us- 
ing 72,3 can be obtained through shiny theories, as follows. Shiny theories were 
introduced in [21] for the mono-sorted case, and were generalized to many-sorted 
signatures in two different ways in [8] and [17]. In [8], T2, was introduced as a 
theory that is shiny according [17], but not according to [8]. Theorem 1 of [8] 
states that their notion of shininess is equivalent to strong politeness for theories 
in which the satisfiability problem for quantifier-free formulas is decidable. Since 
this is the case for T>3, and since it is not shiny according to [8], we get that 
T2,3 is not strongly polite. Further, Proposition 18 of [17] states that every shiny 
theory (according to their definition) is polite. Hence we get that Tz 3 is polite 
but not strongly polite. 

We have (and prefer) a direct proof based only on politeness, without a detour 
through shininess. Note also that [8] dealt only with strongly polite theories and 
did not study the weaker notion of polite theories. In particular, the fact that 
strong politeness is different from politeness was not stated nor proved there. 


3.3 The Case of Mono-sorted Polite Theories 


Theory 72,3 includes two sorts but is otherwise empty. In this section, we show 
that requiring two sorts is essential for separating politeness from strong polite- 
ness in otherwise empty signatures. That is, we prove that politeness implies 
strong politeness otherwise. Let Xo be the signature with a single sort o and 
no function or predicate symbols (except =). We show that smooth o-theories 
have a certain form and conclude strong politeness from politeness. 


Lemma 4. Let T be a Xo-theory. If T is smooth w.r.t. o and includes a finite 
structure, T is axiomatized by YS „ from Figure 1 for some n > 0. 


Proposition 1. If T is a Xo-theory that is polite w.r.t. o, then it is strongly 
polite w.r.t. o. 


Remark 2. We again note (as we did in Remark 1) that an alternative way 
to obtain this result is via shiny theories, using [17], which introduced polite 
theories, as well as [7], which compared strongly polite theories to shiny theories 
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in the mono-sorted case. Specifically, in the presence of a single sort, Proposition 

19 of [17] states that: 

(x) if the question of whether a polite theory over a finite signature con- 
tains a finite structure is decidable, the theory is shiny. 

In turn, Proposition 1 of [7] states that: 

(xx) every shiny theory over a mono-sorted signature with a decidable sat- 
isfiability problem for quantifier-free formulas is also strongly polite. 

It can be shown that the question of whether a polite Xo-theory contains a finite 

structure is decidable. It can also be shown that satisfiability of quantifier-free 

formulas is decidable for such theories. Using (*) and (**), we get that in Xo- 

theories, politeness implies strong politeness. As above (Remark 1), we prefer a 

direct route for showing this result, without going through shiny theories. 


3.4 Mono-sorted Finite Witnessability 


We have seen that for Xo-theories, politeness and strong politeness are the same. 
Now we show that smoothness is crucial for this equivalence, i.e., that there is 
no such equivalence between finite witnessability and strong finite witnessability. 
Let 77g.,, be the Lo-theory of all Xo-structures A such that |o4| is even or 
infinite. Clearly, this theory is not smooth. 


Lemma 5. Tkon is not smooth w.r.t. o. 


We can construct a witness wit for Tesen as follows. Let @ be a quantifier-free 
Xo-formula, and let E be the set of all equivalence relations over vars(d) U {w} 
for some fresh variable w. Let even(E) be the set of all equivalence relations in 
E with an even number of equivalence classes. Then, wit(¢) is ¢A Vee even(B) de, 
where for each e € even(E), ôe is the arrangement induced by e: 


IN zr=y A VAN crAY 


(x,y)Ee xz,yevars(P)U{w}A(x,y) Ze 


OO 


It can be shown that wit is indeed a witness, and that TRben 


witness, with a proof similar to that of Lemma 3. 


has no strong 


Lemma 6. Teon is finitely witnessable w.r.t. o. 


Lemma 7. Teen is not strongly finitely witnessable w.r.t. o. 


4 A Blend of Polite and Stably-Infinite Theories 


In this section, we show that the polite combination method can be optimized 
to reduce the search space of possible arrangements. In what follows, X1 and Xə 
are disjoint signatures, S = Sy, NS»y,, Ti is a X1-theory, 72 is a Xə-theory, I 
is a set of X4-literals, and I> is a set of Xə-literals. 


8 Notice that T°. can be axiomatized using the set {722,41 | n € N}. 
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The Nelson-Oppen procedure reduces the 7; @ 7o-satisfiability of I U I> 
to the existence of an arrangement ô over the set V = varss(I\) N varss(I»), 
such that I U ô is J1-satisfiable and I> U ô is Jo-satisfiable. The correctness of 
this reduction relies on the fact that both theories are stably infinite w.r.t. S. In 
contrast, the polite combination method only requires a condition (namely strong 
politeness) from one of the theories, while the other theory is unrestricted and, 
in particular, not necessarily stably infinite. In polite combination, the 7; ® 72- 
satisfiability of I U I> is again reduced to the existence of an arrangement 0, 
but over a different set V’ = varsg(wit(I2)), such that I, U ô is Ti-satisfiable 
and wit(Iz) U ô is To-satisfiable, where wit is a strong witness for Ta w.r.t. S. 
Thus, the flexibility offered by polite combination comes with a price. The set 
V’ is potentially larger than V as it contains all variables with sorts in S that 
occur in wit(I)), not just those that also occur in I. Since the search space 
of arrangements over a set grows exponentially with its size, this difference can 
become crucial. If 7, happens to be stably infinite w.r.t. S, however, we can fall 
back to Nelson-Oppen combination and only consider variables that are shared 
by the two sets. But what if 7; is stably infinite only w.r.t. to some proper subset 
S’ C S? Can this knowledge about 7; help in finding some set V” of variables 
between V and V’, such that we need only consider arrangements of V”? In this 
section we prove that this is possible by taking V” to include only the variables 
of sorts in S’ that are shared between I, and wit(I>), and all the variables of 
sorts in § \ S that occur in wit(I)). We also identify several weaker conditions 
on To that are sufficient for the combination theorem to hold. 


4.1 Refined Combination Theorem 
To put the discussion above in formal terms, we recall the following theorem. 


Theorem 3 ([12]). Jf Ta is strongly polite w.r.t. S with a witness wit, then 
the following are equivalent: 1. Ti U T> is (Tı ® T>)-satisfiable; 2. there exists an 
arrangement dy over V, such that Tı U dy is T1-satisfiable and wit(I2) U ôy is 
T2-satisfiable, where V = jeg Vo, and Vz = vars,(wit(I2)) for each o € S. 


Our goal is to identify general cases in which information regarding 7; can 
help reduce the size of the set V. We extend the definitions of stably infinite, 
smooth, and strongly finitely witnessable to two sets of sorts rather than one. 
Roughly speaking, in this extension, the usual definition is taken for the first 
set, and some cardinality-preserving constraints are enforced on the second set. 


Definition 4. Let X be a signature, S1, S2 two disjoint subsets of Ss, and T a 
X-theory. 

T is (strongly) stably infinite w.r.t. (S1, S2) if for every quantifier-free X- 
formula ¢ and T -interpretation A satisfying ġ, there exists a T -interpretation B 
such that B |= @, |a| is infinite for every o € Sı, and |a®| < |aA| (o8| = |oA|) 
for every o € So. 

T is smooth w.r.t. (S1, S2) if for every quantifier-free X-formula ¢, T- 
interpretation A satisfying ¢, and function k from Sı to the class of cardinals 
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such that K(a) > |o+| for each o € Sı, there exists a T -interpretation B that 
satisfies ġ, with |o?| = K(c) for each o € Sy, and with |o?| infinite whenever 
|o4| is infinite for each a € S2. 

T is strongly finitely witnessable w.r.t. (S1, S2) if there exists a computable 
function wit : QF (2) > QF (2) such that for every quantifier-free X-formula 
ob: 1. 6 and AW. wit(d) are T-equivalent for W = vars(wit(@)) \ vars(); and 
2. for every T -interpretation A and arrangement 6 of any set of variables whose 
sorts are in Sı, if A satisfies wit(¢d) A ô, then there exists a T -interpretation B 
that finitely witnesses wit(d) Ad w.r.t. Sı and for which |o” | is infinite whenever 
[o4] is infinite, for each o € S3. 


Our main result is the following. 


Theorem 4. Let S* C S and S” = S \ S*. Suppose Ti is stably infinite 
w.r.t. S* and one of the following holds: 


1. Tz is strongly stably infinite w.r.t. (S°,S"5*) and strongly polite w.r.t. S$" 
with a witness wit. 

2. To is stably infinite w.r.t. (S*, S$"), smooth w.r.t. (S"*, S), and strongly 
finitely witnessable w.r.t. S” with a witness wit. 

3. To is stably infinite w.r.t. S* while smooth and strongly finitely-witnessable 
w.r.t. (S"*, S5) with a witness wit. 


Then the following are equivalent: 1. PUI is (1 @72)-satisfiable; 2. There exists 
an arrangement dy over V such that I Udy is T1-satisfiable, and wit(Iy)Udy is 
T2-satisfiable, where V = (J,es Vo, with Vz = vars,(wit(I2)) for every o € gner 
and Vz = varso (Ii) N varso(wit(I2)) for every o € S*. 


All three items of Theorem 4 include assumptions that guarantee that the two 
theories agree on cardinalities of shared sorts. For example, in the first item, we 
first shrink the $"°*-domains of the T>-model using strong finite witnessability, 
and then expand them using smoothness. But then, to obtain infinite domains 
for the S* sorts, stable infiniteness is not enough, as we need to maintain the 
cardinalities of the S” domains while making the domains of the $** sorts 
infinite. For this, the stronger property of strong stable infiniteness is used. 

The formal proof of this theorem is provided in Section 4.2, below. Figure 2 
is a visualization of the claims in Theorem 4. The theorem considers two variants 
of strong finite witnessability, two variants of smoothness, and three variants of 
stable infiniteness. For each of the three cases of Theorem 4, Figure 2 shows 
which variant of each property is assumed. The height of each bar corresponds 
to the strength of the property. In the first case, we use ordinary strong finite 
witnessability and smoothness, but the strongest variant of stable infiniteness; 
in the second, we use ordinary strong finite witnessability with the new variants 
of stable infiniteness and smoothness; and for the third, we use ordinary stable 
infiniteness and the stronger variants of strong finite witnessability and smooth- 
ness. The order of the bars corresponds to the order of their usage in the proof of 
each case. The stage at which stable infiniteness is used determines the required 
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strong 
medium 


regular 


Case 1 Case 2 Case 3 


EEE strong finite witnessability EE smoothness EM stable infiniteness 


Fig. 2. Theorem 4. The height of each bar corresponds to the strength of the 
property. The bars are ordered according to their usage in the proof. 


strength of the other properties: whatever is used before is taken in ordinary 
form, and whatever is used after requires a stronger form. 

Going back to the standard definitions of stable infiniteness, smoothness, and 
strong finite witnessability, we get the following corollary by using case 1 of the 
theorem and noticing that smoothness w.r.t. © implies strong stable infiniteness 
w.r.t. any partition of S. 


Corollary 1. Let S* C S and S” = S\ S*. Suppose Ti is stably infinite 
w.r.t. S* and Tz is strongly finitely witnessable w.r.t. S” with witness wit and 
smooth w.r.t. S. Then, the following are equivalent: 

1. DUT is (Ti ® T2)-satisfiable; 2. there exists an arrangement dy over 
V such that I, U dy is Ty-satisfiable and wit(I2) U by is T2-satisfiable, where 
V = Uses Vo, with Ve = vars,(wit(I2)) for o € S™* and Vy = varse(I1)M 
vars,(wit(I2)) fora € S*. 


Finally, the following result, which is closest to Theorem 3, is directly ob- 
tained from Corollary 1, since the strong politeness of 73 w.r.t. S**US"* implies 
that it is strongly finitely witnessable w.r.t. S”°’ and smooth w.r.t. S% U 8°% 


Corollary 2. Let S C S and S” = S\ S*. If Ti is stably infinite w.r.t. S* 
and To is strongly polite w.r.t. S with a witness wit, then the following are 
equivalent: 1. Ti U Ia is (Ti ® 72)-satisfiable; 2. there exists an arrangement 
dy over V such that I, U dy is T-satisfiable and wit(Iz) U dy is Te-satisfiable, 
where V = Uses Vo, with Vo = vars,(wit(Iz)) for each o € S"* and Vz = 
varso( I1) N vars, (wit(I)) for each o € S*. 


Compared to Theorem 3, Corollary 2 partitions S into S* and S” and 
requires that 7; be stably infinite w.r.t. S**. The gain from this requirement is 
that the set V, is potentially reduced for ø € $**. Note that unlike Theorem 4 
and Corollary 1, Corollary 2 has the same assumptions regarding 72 as the 
original Theorem 3 from [12]. We show its potential impact in the next example. 


Example 7. Consider the theory Tjistmepva from Example 3. Let Iı be x = 
5Av = 0000Aw = w & v, and let Ip be ag = cons(x,v,a1) A Aj, a = 
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cons(y;,W,a;41). Using the witness function wit from [18], wit(Ih) = Im. The 
polite combination approach reduces the TListintgva-satisfiability of Ti A Ia to 
the existence of an arrangement 6 over {z,v,w}U{y1,-.-, Yn}, such that I Að is 
Tintpva-satisfiable and wit(I2) A 6 is Thist-satisfiable. Corollary 2 shows that we 
can do better. Since Tintpva is stably infinite w.r.t. {int}, it is enough to check 
the existence of an arrangement over the variables of sort BV4 that occur in 
wit(Iy), together with the variables of sort int that are shared between I and 
I. This means that arrangements over {x,v,w} are considered, instead of over 
{x,v,whUf{yi,---,; Yn}. As n becomes large, standard polite combination requires 
considering exponentially more arrangements, while the number of arrangements 
considered by our combination method remains the same. 


4.2 Proof of Theorem 4 


The left-to-right direction is straightforward, using the reducts of the satisfy- 
ing interpretation of I, UI) to Xı and Xə. We now focus on the right-to-left 
direction, and begin with the following lemma, which strengthens Theorem 1, 
obtaining a many-sorted Lowenheim-Skolem Theorem, where the cardinality of 
the finite sorts remains the same. 


Lemma 8. Let X be a signature, T a X-theory, p a X-formula, and A a T- 
interpretation that satisfies d. Let Sy = so W SET, where o^ is finite for 
every o € sir and o^ is infinite for every o € S. Then there exists a T- 
interpretation B that satisfies p such that |o®| = |o^]| for every o € sir and 
aË is countable for every o € a, 

The proof of Theorem 4 continues with the following main lemma. 


Lemma 9 (Main Lemma). Let S* C S and S”* = S \ S*, Suppose Ti 
is stably infinite w.r.t. S°’ and that one of the three cases of Theorem 4 holds. 
Further, assume there exists an arrangement dy over V such that I U dy is 
TJ, -satisfiable, and wit(Iz) Udy is T2-satisfiable, where V = Ujeg Vo, with Vo = 
vars,(wit(I2)) for each o € S”® and Vs = varsg(I1) N vars, (wit(I2)) for each 
a € S*. Then, there is a T,-interpretation A that satisfies T) U dy and a To- 
interpretation B that satisfies wit(Iz) Udy such that |o4| = |o8| for alla € S. 


Proof: Let wz := wit(I2). Since Ti is stably infinite w.r.t. S*, there is a Ti- 
interpretation A satisfying I, U dy in which g^ is infinite for each o € S*. 
By Theorem 1, we may assume that o^ is countable for each a € S*. We 
consider the first case of Theorem 4 (the others are omitted due to space con- 
straints). Suppose 72 is strongly stably infinite w.r.t. (8%, 8”) and strongly 
polite w.r.t. S”%*. Since Jz is strongly finitely-witnessable w.r.t. S"*’, there 


exists a Jo-interpretation B that satisfies Y2 U dy such that o8 = V2 for 
each o € S”, Since A and B satisfy dy, we have that for every o € S”, 
|o3| = |V3| = |VA| < [o^]. Tə is also smooth w.r.t. S”’, and so there ex- 


ists a J2-interpretation B’ satisfying Y2 U dy such that os| = |o4| for each 
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a € S”. Finally, 72 is strongly stably infinite w.r.t. ($**, S"°*), so there is a T2- 

interpretation B” that satisfies %2 U dy such that ø” is infinite for each o € S** 
| = B’ 

ao |= |g 


and = |o“| for each o € 9”. By Lemma 8, we may assume that 


a” is countable for each o € S*. Thus, 


o®"| = |o4| for each ao € S. 


We now conclude Theorem 4: Let T := Ti @ 7g. Lemma 9 gives us a Fi 
interpretation A with A = I Udy and a Jz interpretation B with B = Y2 U ôv, 
and |o4| = |o” | for o € S. Set T] := Iı Udy and T; := Y2 U by. Then, V; = 
varso(I{)Nvarso(T}) for o € S. Now, AK Udy and B H I$Uôv. Also, |o4]| = 
|o8| for o € S. By Theorem 2, T} UT} is T-satisfiable. In particular, T4 U {Y2} is 
T-satisfiable, and hence also I, U {30.42}, with W = vars (wit(Ia)) \ vars (T2). 


Finally, Jw. wit(I) is T2-equivalent to I, hence I U I is T-satisfiable. 


5 Preliminary Case Study 


The results presented in Section 4 was motivated by a set of smart contract 
verification benchmarks. We obtained these benchmarks by applying the open- 
source Move Prover verifier [22] to smart contracts found in the open-source Diem 
project [9]. The Move prover is a formal verifier for smart contracts written in the 
Move language [6] and was designed to target smart contracts used in the Diem 
blockchain [1]. It works via a translation to the Boogie verification framework 
[14], which in turn produces SMT-LIB 2 benchmarks that are dispatched to SMT 
solvers. The benchmarks we obtained involve datatypes, integers, Booleans, and 
quantifiers. Our case study began by running CVC4 [3] on the benchmarks. For 
most of the benchmarks that were solved by CVC4, theory combination took a 
small percentage of the overall runtime of the solver, accounting for 10% or less 
in all but 1 benchmark. However, solving that benchmark took 81 seconds, of 
which 20 seconds was dedicated to theory combination. 

We implemented an optimization to the datatype solver of CVC4 based on 
Corollary 2. With the original polite combination method, every term that orig- 
inates from the theory of datatypes with another sort is shared with the other 
theories, triggering an analysis of the arrangements of these terms. In our op- 
timization, we limit the sharing of such terms to those of Boolean sort. In the 
language of Corollary 2, J; is the combined theory of Booleans, uninterpreted 
functions, and integers, which is stably infinite w.r.t. the uninterpreted sorts 
and integer sorts. 72 is an instance of the theory of datatypes, which is strongly 
polite w.r.t. its element sorts, which in this case are the sorts of Fi. 

A comparison of an original and optimized run on the difficult benchmark 
is shown in Figure 3. As shown, the optimization reduces the total running 
time by 75%, and the time spent on theory combination in particular by 83%. 
To further isolate the effectiveness of our optimization, we report the number of 
terms that each theory solver considered. In CVC4, constraints are not flattened, 
so shared terms are processed instead of shared variables. Each theory solver 
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total (s)|comb (s)/| DT | INT | UFB |shared 
optimized] 34.9 3.4 236.1/212.1| 78.4 | 125.8 
original 81.5 20.3 116.0) 281.0}123.9} 163.5 


Fig. 3. Runtimes (in seconds) and number of terms (in thousands) added to the 
data structures of DT, INT, UFB, and the number of shared terms (shared). 


maintains its own data structure for tracking equality information. These data 
structures contain terms belonging to the theory that either come from the 
input assertions or are shared with another theory. A data structure is also 
maintained that contains all shared terms belonging to any theory. The last 4 
columns of Figure 3 count the number of times (in thousands) a term was added 
to the equality data structure for the theory of datatypes (DT), integers (INT), 
and uninterpreted functions and Booleans (UFB), as well as to the the shared 
term data structure (shared). With the optimization, the datatype solver keeps 
more inferred assertions internally, which leads to an increase in the number of 
additions of terms to its data structure. However, sharing fewer terms, reduces 
the number of terms in the data structures for the other theories. Moreover, while 
the total number of terms considered remains roughly the same, the number of 
shared terms decreases by 24%. This suggests that although the workload on the 
individual theory solvers is roughly similar, a decrease in the number of shared 
terms in the optimized run results in a significant improvement in the overall 
runtime. Although our evidence is only anecdotal at the moment, we believe this 
benchmark is highly representative of the potential benefits of our optimization. 


6 Conclusion 


This paper makes two contributions. First, we separated politeness and strong 
politeness, which shows that sometimes, the (typically harder) task of finding 
a strong witness is not a waste of effort. Then, we provided an optimization to 
the polite combination method, which applies when one of the theories in the 
combination is stably infinite w.r.t. a subset of the sorts. 

We envision several directions for future work. First, the sepration of polite- 
ness from strong politeness demonstrates a need to identify sufficient criteria for 
the equivalence of these notions — such as, for instance, the additivity criterion 
introduced by Sheng et al. [18]. Second, polite combination might be optimized 
by applying the witness function only to part of the purified input formula. Fi- 
nally, we plan to extend the initial implementation of this approach in CVC4 
and evaluate its impact based on more benchmarks. 
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Abstract. Unlike other methods for theorem proving modulo with con- 
strained clauses [12,13], equational theorem proving modulo with con- 
strained clauses along with its simplification techniques has not been well 
studied. We introduce a basic paramodulation calculus modulo equa- 
tional theories E satisfying certain properties of Æ and present a new 
framework for equational theorem proving modulo E with constrained 
clauses. We propose an inference rule called Generalized E-Parallel for 
constrained clauses, which makes our inference system completely ba- 
sic, meaning that we do not need to allow any paramodulation in the 
constraint part of a constrained clause for refutational completeness. We 
present a saturation procedure for constrained clauses based on relative 
reducibility and show that our inference system including our contraction 
rules is refutationally complete. 


1 Introduction 


Equations occur frequently in many areas of mathematics, logics, and com- 
puter science. Equational theorem proving [6,8,19,22] is, in general, con- 
cerned with proving mathematical or logical statements in first-order clause 
logic with equality. While resolution [24] has been successful for theorem prov- 
ing for first-order clause logic without equality, it has some limitations to deal 
with the equality predicate. For example, when dealing with the equality pred- 
icate using resolution, one must add the congruence axioms explicitly for each 
predicate and function symbol in order to express the properties of equal- 
ity [8,22]. 

Paramodulation [23] is based on the replacement of equals by equals, in or- 
der to improve the efficiency of resolution in equational theorem proving. How- 
ever, paramodulation, in general, often produces a large amount of unnecessary 
clauses, so the search space for a refutation expands very rapidly. Therefore, var- 
ious improvements have been developed for paramodulation. For example, it was 
shown that the functional reflexivity equations used by the traditional paramod- 
ulation rule [23] are not needed, and paramodulation into variables does not need 
to be allowed (see [8]). 

Basic paramodulation [9,20] restricts paramodulation by forbidding paramod- 
ulation at (sub)terms introduced by substitutions from previous inference steps, 
and uses orderings on terms and literals in order to further restrict paramod- 
ulation inferences. In [21, 26], basic paramodulation had been extended to 
basic paramodulation modulo associativity and commutativity (AC) axioms. 


© The Author(s) 2021 
A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 166-182, 2021. 
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(See [25] also for basic paramodulation modulo the associativity (A) axiom.) 
Basic paramodulation modulo AC uses the symbolic constraints, overcoming a 
drawback of traditional paramodulation modulo AC (see [7,27]) that often gener- 
ates many slightly different permuted variants of clauses. For example, more than 
a million conclusions can possibly be generated by paramodulating the equation 
x+x+a = 2 into the clause P(y1 + y2 + y3 + ys) for which + is an AC symbol, 
since a minimal complete set of AC-unifiers for x + x + x and yı + yo + y3 + Y4 
contains more than a million AC-unifiers [21,26]. On the other hand, one only 
needs a single conclusion P(x) || x +£ +x 240 yı + yo +yz + ya for the above 
inference using basic paramodulation modulo AC with an equality constraint. 

In this paper, we present a new basic paramodulation calculus modulo equa- 
tional theories E (including E = AC) parameterized by a suitable E-compatible 
ordering >. Our main inference rule for basic paramodulation modulo EF is given 
(roughly) as follows: 


CVsxt||g DVLs'] || ¢2 
CV DV Lit] || s x3 s A g1 A g2 


The equality constraints are inherited and the accumulated E-unification prob- 
lems are kept in the constraint part of conclusion. Instead of generating as 
many conclusions as minimal and complete E-unifiers of two terms s and s’, 
a single conclusion is generated with its constraint keeping the -unification 
problem of s and s’. Another key inference rule in our basic paramodulation 
calculus modulo Æ is the Generalized E-Parallel (or E-Parallel) rule, adapted 
from our recent work on basic narrowing modulo [18]. This rule allows our basic 
paramodulation calculus to adapt the free case (i.e. Æ = Ø) to the modulo Æ 
case (i.e. E £)).1 For example, suppose that we have three clauses 1 : a+b * c, 
2:a+(b+x2) c+, and 3: (a+a)+(b+b) #c+c, where + is an AC symbol 
with + > a > b > c. We use the E-Parallel rule from clause 1 and 2 and obtain 
the clause 4: a+ (b + (a+b)) © c +c, which derives a contradiction with clause 
3 because a + (b+ (a+ 6)) Sac (a+ a) + (b+ b) (i.e. the equality constraint is 
satisfiable). The details of this inference rule are discussed in Section 4. 

Throughout this paper, we assume that (i) we are given an E-compatible 
reduction ordering > on terms with the subterm property that is E-total on 
ground terms, (ii) E has a finitary and complete unification algorithm, and (iii) 
E-congruence classes are finite. (If E satisfies condition (i), then E is necessarily 
regular [2].) With these assumptions of E, we can deal uniformly with different 
equational theories E in our framework and show that our inference system in- 
cluding our contraction rules is refutationally complete. 

The known practical theories satisfying the above assumptions of E are AC 
and finite permutation theories [1,17]. (For example, if one considers an ACT 
symbol + using our approach, then AC should be a modulo Æ part and the 
idempotency axiom (I :x +a ~ x) should be a part of the input formulas.) Al- 
though associative (A)-unification is infinitary, our approach is also applicable 


1 If E = Ø, then we may disregard the Generalized E-Parallel (or E-Parallel) rule along 
with the E-Completion rule and replace E-unification with syntactic unification. 
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to the case where E = A in practice, since there is a tool for A-unification which 
is guaranteed to terminate with a finite and complete set of A-unifiers for a sig- 
nificantly large class of A-unification problems (see [14]). 

The longer version of this paper is found in [16]. 


2 Preliminaries 


We assume that the reader has some familiarity with rewrite systems [3] (in- 
cluding the extended rewrite system for R modulo E (i.e. R, E) [11,15]) and 
unification [4]. We use the standard terminology of paramodulation [6,9, 22]. 

We denote by T(F, X) the set of terms over a finite set of function sym- 
bols F and a denumerable set of variables VY. An equation is an expression 
s ~ t, where s and t are (first-order) terms built from T(F, æ). A literal is 
either an equation L (a positive literal) or a negative equation =D (a negative 
literal). A clause is a finite multiset of literals, written as a disjunction of literals 
5A, Ve VaAm V By V- V Bn or as an implication [ — A, where the multi- 
set I” is called the antecedent and the multiset A is called the succedent of the 
clause. (Recall that a multiset is an unordered collection with possible duplicate 
elements.) 

An equational theory is a set of equations. (In this paper, an equational the- 
ory and a set of axioms are used interchangeably.) We denote by ~p the least 
congruence on T(F, Æ) that is closed under substitutions and contains a set of 
equations E. If s +g t for two terms s and t, then s and t are E-equivalent. 

A (strict) ordering > on terms is monotonic if s > t implies u[s], > ult], 
for all s, t, u and positions p. An ordering > on terms is stable under substitu- 
tions if s > t implies so > to for all s,t, and substitutions ø. An ordering > on 
terms is a rewrite ordering if it is monotonic and stable under substitutions. A 
well-founded rewrite ordering is a reduction ordering. An ordering > on terms 
has the subterm property if t[s], > s for all s, t, and p Æ X. (In this paper, A 
denotes the top position.) A simplification ordering is a rewrite ordering with 
the subterm property. An ordering > on terms is £-compatible if s > t, sz s’, 
and t ~p t implies s’ > t’ for all s,s’,t and t’. An ordering > on ground terms 
is E-total if s %p t implies s > t or t > s for all ground terms s and t. 

Given a multiset S and an E-compatible ordering > on S, we say that x is 
maximal (resp. strictly maximal) in S if there is no y € S (resp. y € S \ {x}) 
with y > x (resp. y = x). 

Clauses may also be considered as multisets of occurrences of equations. An 
occurrence of an equation s ~ t in the antecedent of a clause is the multiset 
{{s,¢}}, and in the succedent it is the multiset {{s}, {t}}. We denote ambigu- 
ously all those orderings on terms, equations and clauses by >. 

An equational theory is permutative if each equation in the theory contains 
the same symbols on both sides with the same number of occurrences. The 
depth of a term t is defined as depth(t) = 0 if t is a variable or a constant and 
depth( f(51,.--,5n)) = 1+max{depth(s;)|1 < i < n}. We say that an equational 
theory has maximum depth at most k if the maximum depth of all terms in the 


Equational Theorem Proving Modulo 169 


equations in the theory is less than or equal to k. 

A (Herbrand) interpretation I is a congruence on ground terms. I satisfies 
(is a model of) a ground clause l — A, denoted by I E Fr > A, if I DI 
or IN A Æ Í. In this case, we say that l — A is true in I. A ground clause 
C follows from a set of ground clauses {C),...,Cx}  C if C is true in every 
model of {C,...,Cx}. 


3 Constrained Clauses 


Definition 1 (Constrained clauses) [22,26] A constrained clause is a pair C || ¢, 
where C is a clause and ¢ is an equality constraint consisting of a conjunction of 
the form s at t for terms s and t. The set of solutions of a constraint ¢, denoted 
by Sol(@), is the set of the ground substitutions defined inductively as: 


Sol(b1 A $2) = Sol(g1) N Sol(¢2), 
Sol(s x% t) = {a | so and to are E-equivalent}, 


A constraint ¢ is satisfiable if it admits at least one solution. 


A constrained clause with an unsatisfiable constraint is a tautology. If every 
ground substitution with domain Vars(@) of C || ¢ is a solution of ¢, then ¢ is 
a tautological constraint. An unconstrained clause can also be considered as a 
constrained clause with a tautological constraint. 

The main technical difficulties in lifting a reduced ground inference to an 
inference at the clause level in a basic paramodulation inference system involve 
a ground clause of the form Co := Do V xo ~ to with C := DVa x t\||¢ 
and a € Sol(¢), where xo = to € R for a given ground rewrite system R. 
This motivates the following definition of irreducibility to lift a reduced ground 
inference to an inference at the clause level in our inference system. (See [9] also 
for order-irreducibility in the free case.) 


Definition 2 (Order-irreducibility) Given a ground rewrite system R and an 
equational theory EF, a ground literal L[l’], is order-reducible (at position p) 
by R,E with l > re Rif l ~g l,l > rand L > l~r. A literal L[s] is 
order-irreducible in s by R, E if L[s] is not order-reducible at any position of s. 


In Definition 2, the condition L > l ~ r is always true when L is a negative 
literal or else l’ does not occur at the top (i.e. p = À) of the largest term of L. 


Definition 3 (Reduced ground instances) Given a ground rewrite system R 
and an equational theory E, Co is a ground instance of C || ¢ if o is a solution 
of ọ (i.e. o € Sol(¢)). It is a reduced ground instance of C || ọ w.r.t. R, E if o is 
a solution of ġ and each ground literal L[xo] in Cø is order-irreducible in xo by 
R, E for each variable x € Vars(C). In this case, ø is a reduced solution of C || ¢ 
wr.t. R, E. 


Definition 4 (A model of a constrained clause) An interpretation I satisfies 
(is a model of) a constrained clause C’'|| ¢, denoted by I — C || ¢, if it satisfies 
every ground instance of C || ¢ (i.e. every Co for which ø is a solution of ¢). 
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Definition 5 (Reductiveness, weak reductiveness, semi-reductiveness, and weak 
maximality) An equation s ~ t is reductive (resp. weakly reductive) for C || ¢ := 
D V s ~ t||¢ if there exists a ground instance Co such that so ~ ta is strictly 
maximal (resp. maximal) in Co with so > to. The clause C || is simply called 
reductive if there exists a reductive equation s + t for C || ¢. A negative equation 
u % v is semi-reductive (resp. weakly reductive) for C'||¢:= DVu # v || ¢ if there 
exists a ground instance Co such that uo > vo (resp. uo > vo and uo # vo is 
maximal in Co). A literal L is weakly maximal for C || := DV L||¢ if there 
exists a ground instance Co such that Lo is maximal in Co. 


4 Inference Rules 


The inference rules in our inference system are parameterized by a selection 
function S and an £-compatible reduction ordering > with the subterm property 
that is E-total on ground terms, where S selects at most one (occurrence of a) 
negative literal in the clause part C of each (constrained) clause C ||. For 
technical convenience, if a literal L is selected in C, then we also say that L is 
selected in C || ¢. In our inference rules, a literal in a clause C || ¢ is involved in 
some inference if it is selected in C (by S) or nothing is selected and it is maximal 
in C (cf. [8]). The following Basic Paramodulation rule is our main inference 
rule for equational theorem proving modulo E, where only the maximal sides of 
literals in clauses are involved in inferences by this rule. We rename variables 
in the premises in our inference rules if necessary so that no variable is shared 
between premises (i.e. standardized apart). 


Basic Paramodulation 


CVsxt|| g DV L{s'} || $2 
CV DV Lt] || s x3 s A Q1 A g2 


1. s’ is not a variable, 
2. s X t is reductive for the left premise, and C contains no selected literal, 
3. either one of the following three conditions is met: 
(a) L is selected in the right premise, and 
L is of the form u[s’] % v and is semi-reductive for the right premise. 
(b) nothing is selected in the right premise, and 
L is of the form u[s’] ~ v and is reductive for the right premise. 
(c) nothing is selected in the right premise, and 
L is of the form u[s’] % v and is weakly reductive for the right premise. 


Equality Resolution 


CVsxt\|o 


ae 
C\lsxptrn@ 
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s Æ t is selected, or else nothing is selected and s æ% t is weakly maximal for the 
premise. 


E-Factoring 


CVsxtVs axt \|¢ 


if 
Cvtžť Vs wt ||sxb sno 


s = t is weakly reductive for the premise, and C contains no selected literal. 
E-Completion 


CVsext||¢ 


; if 
CV ei[t]p = e2 || s xh AQ 


1. ei[s']p © e2 € E and p Æ A, where s’ is not a variable, 
2. s X t is reductive for the premise, and C contains no selected literal. 


The above £-Completion rule is an adaptation of the E-closure [27] rule 
using equality constraints (cf. E-extension [5]). 


E-Parallel 


CV sxt||di DVI Tr|| de 
CV Do Vlo x rê || 6, A d2 


if 


s = t is reductive for the left premise, and C contains no selected literal, 
lr is reductive for the right premise, and D contains no selected literal, 
both l and s are not variables, 

o={zx > s} and 0={x > t} for some variable x € Vars(l) N Vars(r) with 
x ¢ Vars(da), 

5. there is a term u’ with u’ %p lo, such that u’ is R, E-reducible with 
R = {l = r,s => t} only at the top position (i.e. no strict subterm of wu’ 
is R, E-reducible). 


kee ae 


Generalized E-Parallel 


CVsxt||di DVI Tr|| de 
CV Do Vlo x rô || ġ1 A 2 


if 


s X t is reductive for the left premise, and C contains no selected literal, 

l ~ r is reductive for the right premise, and D contains no selected literal, 
both l and s are not variables, 

eilu] © e2 € E, where u is not a variable, 

o = {x +> u[s]p} and 6 = {x +> ult],} for some variable x € Vars(l) N 
Vars(r) with z € Vars(¢2) and some position p, 


Shee ee 
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6. there is a term wu’ with u’ ~p lo, such that wu’ is R, E-reducible with 
R = {l = r,s => t} only at the top position. 


We mark each clause produced by the Generalized E-Parallel (or E-Parallel) 
rule as “protected” so that it is protected from our contraction rules discussed in 
Section 5. (We simply say each marked clause is a protected clause.) Protected 
clauses behave the same way as other clauses in our inference rules, but our 
contraction rules are not applied to protected clauses (see Section 5 for details). 

We may also use predicate terms [6] P(ti,...,tn) in our inference system, 
where a predicate term cannot be a proper subterm of any term. Note that a 
predicate term P(t,,...,t,) can be expressed as an equation P(t1,...,tn) % T, 
where T is a special constant symbol minimal in the ordering > and P is con- 
sidered as a function symbol. (In this sense, —P(t,,...,t,) can be expressed 
as P(ti,...,tn) Æ% T.) In the remainder of this paper, by BP we denote the 
inference system consisting of the Basic Paramodulation, Equality Resolution, 
E-Factoring, E-Completion, and the Generalized E-Parallel rule. If E is a per- 
mutative theory with maximum depth at most 2 (e.g. E = A,C, or AC), then 
we use the simpler E-Parallel rule instead of the Generalized E-Parallel rule in 
BP (see Lemma 6). 


Example 1. Let + be an AC symbol (in infix notation) with + > a> b> 0 and 
consider the following inconsistent set of clauses 1: x +0 8 a, 2:a+a 7% 0, 3: 
b+b 7 0, and 4: (a+b)+ (a+b) % 0. Now we show how the empty clause (with 
a satisfiable constraint) is derived: 

5: (x+y) +z ~ 24+ 0|lytz ~ho a+a (E-Completion with 2 using the 
associativity axiom £ + (y+ 2) © (x +y) + z2.) 

6: (b+b) +y) +z ~ 0++0||y +z ~c a +a (E-Parallel with 3 into 5. In 
condition 5 of the #-Parallel rule, term u’ corresponds to (b+ y) + (b+ z) here.) 
7T:0+0#0||((b+b) +y) +2 =c (a+b) + (a+b) ^ y+z ac a+ a (Basic 
Paramodulation with 6 into 4) 

8: x ŽO0||r+0 ~c 0+0A ((b+b)+y)+z M46 (a+b)+(a+b) A^ y+z ~c ata 
(Basic Paramodulation with 1 into 7) 

9: 0||z eyo 0Ar+0 xha OFA ((b+b)+y)+z Sho (atb)+(atb)Ayt+z ~ice 
a + a (Equality Resolution on 8) 

In contrast, the existing approaches for basic paramodulation modulo AC [21, 
26] use clauses 2 and 4, for example, and produce clause 5’: 0+a” % 0|| a Sio b+b 
and then clause 6': 0+ y #% O|[a ~he b+b A y &4c 0 by their inference 
rules. Then 6’ is used to derive a contradiction with 1. It can be viewed that 
6’ is obtained from 5’ by an indirect paramodulation with 3 in the constraint 
part. In our approach, we simply block clauses like 5’ from further inferences 
(see Definition 12), and no direct or indirect paramodulation is allowed in the 
constraint part of any clause. 


Example 2. Consider S = {f(g(x)) ~ x,a = b,c % g(b)} and E = {f(g(g(a))) = 
c} with f > g > a> b, where E is a regular theory with maximum depth 3. 
The Generalized E-Parallel rule with premises f(g(a)) ~ x and a ~% b produces 
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the conclusion f(g(g(a))) ~ g(b). (Choose l as f(g(x)), s as a, and u as g(a) in 
the Generalized E-Parallel rule.) Then it is used to derive a contradiction with 


clause c Æ% g(b) since f(g(g(a))) SE c. 


In the above example, a suitable E-compatible reduction ordering > on 
ground terms is obtained in such a way that given two ground terms, we rewrite 
each occurrence of c in each ground term into f(g(g(a))) at the same position 
with (the occurrence of) c and then use the standard lexicographic path order- 
ing [3,22] for comparing (rewritten) ground terms without any occurrence of c. 
Then we may compare terms with variables by considering ground substitutions 
and using this ordering on ground terms. 

In what follows, by the Parallel rule we mean the E-Parallel or the Gener- 
alized E-Parallel rule. First, observe that we cannot derive a contradiction in 
both Examples 1 and 2 using inference rules in BP without the Parallel rule. 
The intuition behind the Parallel rule is that above all, a reductive ground 
clause corresponds to a reductive ground conditional rewrite rule [19] with pos- 
itive and negative conditions. Therefore, roughly speaking, the premises of the 
Parallel rule are reductive conditional rewrite rules with positive and negative 
conditions. (The Parallel rule applies to only reductive clauses.) Now the con- 
clusion of the Parallel rule combines two steps: (i) instantiating a “problematic” 
variable in a special and restricted way, and (ii) selectively rewriting an instan- 
tiated term if conditions are met. (Therefore, conditions C is included in the 
conclusion.) A problematic variable is often determined by a built-in equational 
theory Æ. It is mostly a variable produced by an E-Completion inference (see 
Example 1) for AC cases, which is the counterpart of an extension variable for 
AC-extension [7, 27]. 

Observe that the Generalized E-Parallel rule is more general than the E- 
Parallel rule. If p is always the top position for the Generalized E-Parallel rule, 
then they are equivalent. This is the case for permutative theories with maximum 
depth at most 2 (e.g. E = A,C, or AC). 


Lemma 6 If E is a permutative theory with maximum depth at most 2, then 
the E-Parallel rule and the Generalized E-Parallel rule are equivalent, i.e., they 
generate the same conclusion for the same input premises. 


Note that the E-Completion and the Parallel rule are not always needed 
for every built-in equational theory E. The following example is a simple vari- 
ant of the reachability problem [15] modulo a permutation theory [1,17], where 
=P(f(c,,b,d,e)) is the query from the initial configuration P(f (a,b,c, d,e)). 
We may view EF in the following example as all permutations of variables 
£1, £2, £3, £4, and x5, since the symmetric group S5 is generated by two cycles 
(12) and (12345). 


Example 3. Let E = { f (£1, £2, £3, £4, £5) © f(@2, £1, £3, £4, £5), f (L1, 2,03, L4, 
z5) & f(£2,£3,%4,£5,%1)} with P > f > a > b> c > d > e and 
consider the following set of clauses 1: ~P(f(c,b,b,d,e)), 2: P(f(a,b,c,d,e)), 
and 3: f(a,b,z,y,z) œ~ f(b,b,2,y,z). Basic Paramodulation with 3 into 2 
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yields clause 4: P(f(b,b,z,y,z)) || f(a,b,2,y,2) ~% f(a,b,c,d,e). By apply- 
ing Basic Paramodulation with 1 and 4 (using P(f(c,b,b,d,e)) % T and 
P(f(b,b,2,y,2)) ~ T||f(a,b,£,y,z) ~% f(a,b,c,d,e)) and then applying 
Equality Resolution, we have clause 5: O|| f(b,b, x,y,z) =~% f(c,b,b,d,e) A 
f(a,b,2,y,2) ~z f(a,b,¢,d,e). The equality constraint in 5 is satisfiable and 
we have a contradiction. Note that clause 4 schematizes the set of ground clauses 
{P(f(b, b,c, d,e)), P( f(b, b,c, e, d)), P( f(b, b, d,c,e)), P( f(b, b, d, e,c)), P( f(b, b, e, 
c,d)), P( f(b, b, e, d,c))}. 


5 Redundancy Criteria and Contraction Techniques 


Definition 7 (Relative reducibility) Given an equational theory FE, a ground 
instance Co, of C || di is reduced relative to a ground instance Do2 of D || 92 if 
for any rewrite system R, Co, is a reduced ground instance of C || 61 w.r.t. R, E 
whenever Dog is a reduced ground instance of D || 2 w.r.t. R, E. 


In what follows, the relation < on terms represents the subterm relation, i.e., 
s < t if s is asubterm of t. The relation C on sets of terms is defined as follows: 
{51,..-,5m} E {t1,...,tn} if for all 1 < i < m, there is some 1 < j < n such 
that s; < tj, and Ø E X for any set of terms X. Given a clause C || ¢, we denote 
by Ran(c|vars(c)) for some o € Sol(¢) the range of the restriction of ø to the set 
of variables Vars(C) if Vars(C) # 0. If C is a ground clause with a tautological 
constraint (e.g. the empty constraint), then we set Ran(o|vars(c)) = 9. (Note 
that any ground substitution is a solution of a tautological constraint.) 

We say that a clause C’||¢ is a clause with a succedent top variable [21] 
w.r.t. o E€ Sol(¢) if there is a variable zx € Vars(C) N Vars(@) only appearing in 
equations x ~ t of the succedent of C with xo > to for some t. The following 
lemma, which directly follows from Definition 7, is a sufficient syntactic condition 
for Co, being reduced relative to Doz in Definition 7 if D || 2 is not a clause 
with a succedent top variable w.r.t. a2. If D || ¢2 is a clause with a succedent top 
variable x w.r.t. some o2 E€ Sol(¢2), then one may (partially) instantiate x in 
D with ø% if possible, so that one may use the syntactic condition for checking 
whether Co, is reduced relative to Doz as in the following lemma. 


Lemma 8 Given an equational theory E, a ground instance Co, of C || or 
is reduced relative to a ground instance Doa of D ||ġ2 if Ran(oilvars(c)) E 


Ran(o2|vars(p)) and D||¢2 is not a clause with a succedent top variable 
w.r.t. O2. 


In what follows, we denote by E~° (resp. R~°) the set of ground instances 
of equations in F (resp. the set of ground rewrite rules in R) smaller than the 
ground clause C (w.r.t. +), and by S modulo E a set of clauses S with a built-in 
equational theory F. 


Definition 9 (Redundancy) A clause C'||¢@ is redundant in S modulo E 
(w.r.t. relative reducibility) if for every ground instance Co, there exist ground 
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instances C101, ..., CkOp of clauses Ci || 61,..., Cx || Ak in S reduced relative to 
Ca, such that Co > Cisi, 1 < i < k, and {Ci01,..., Okok }URSOTU ESO" E Co 
for any ground rewite system R contained in >. (In this case, we also say that 
each Co is redundant in S modulo E (w.r.t. relative reducibility).) 


Definition 10 (Basic E-simplification) An equation | ~ r simplifies a clause 

CV L{l’], || into C v L[rp], || ¢ if the following conditions are met: 

(i) p is a non-variable position; 

(ii) there is a substitution p such that lop ~g I’, LW] > lp ~ rp, Vars(lp) 2 
Vars(rp), lp > rp, and C V L{I’], || is neither protected nor a clause with 
a succedent top variable w.r.t. any o E€ Sol(@). 


Lemma 11 If an equation | ~ r simplifies a clause C V L(l']p||ġ into C V 
L{rp\p || @ as in Definition 10, then C V L{l'], || @ is redundant in S modulo E, 
where S = {l ~ r,C V Lireo], || o}- 


The following definition extends the blocking rule in the free case (see [9]) 
to the modulo case, where a blocked clause does not contribute to finding a 
refutation during a theorem proving derivation w.r.t. BP (see Definition 16) 
starting with an initial set of unconstrained clauses. 


Definition 12 (Basic E-blocking) A clause C || ¢ is blocked in S modulo E if 

the following conditions are met: 

(i) C || ¢ is not a clause with a succedent top variable w.r.t. any T € Sol(d); 

(ii) there is a variable x € Vars(C) N Vars(@) such that for every o € Sol(¢), 
there exist ground instances C101, ..., Okap of clauses C1 || d1,..., Cr || Ox 
in S reduced relative to Co, such that Co > Cjoj,1 < i < k, and 
{Cyo1,...,Crop} U EO? H go & s with zo > s for some ground term s. 


Definition 13 (Basic E-instance) A clause C'||@ is a basic E-instance in S 

modulo F if the following conditions are met: 

(i) C|| ¢ is protected; 

(ii) there is a protected clause D || € S such that for every ground instance Co 
(resp. Dr) of C || ¢ (resp. D || Y), there is a ground instance Dr (resp. Co) 
of D || w (resp. C || ġ) such that they are reduced relative to each other with 
Co = Dr. 


Observe that protected clauses are produced in a restricted way (e.g. see 
condition 5 in the £-Parallel rule) and if two protected clauses are the same up 
to variable renaming, then they are basic E-instances of each other and they do 
not need to be distinguished. 


Definition 14 (Redundancy of an inference) An inference m with conclusion 
D || ¢ is redundant in S modulo E (w.r.t. relative reducibility) if D || ¢ is blocked 
or a basic E-instance in S modulo F, or for every ground instance mo with max- 
imal premise C and conclusion Do, there exist ground instances C101, ..., Cor 
of clauses C1 || 61,..., Ck || k in S reduced relative to Do, such that C > Cici, 
1<i<hk, and {C\o,...,Cyox} UR*° U EXC E Do for any ground rewrite 
system R contained in >. 
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The following lemma immediately follows from Definition 9 and the observa- 
tion that if {C101,..., Ckog} U Boo? |= Co, then {C101,..., Okap} U REF U 
EXO? - Co for any ground rewite system R contained in +, which serves as a 
sufficient condition for redundancy of clauses. Also, if an (unconstrained) clause 
C properly subsumes an (unconstrained) clause C’ V D in the classical sense, 
where C and C” are the same up to variable renaming, then it is easy to see that 
C’ V D is redundant in {C} modulo EF. 


Lemma 15 A clause C ||ġ is redundant in S modulo E if for every ground 
instance Co, there exist ground instances C101, ..., Okok of clauses Ci || b1,..., 
Cr || dx in S reduced relative to Co, such that Co > Cici, 1 < i < k, and 
{Cyo1, eae Cron} U Poes = Co. 


Definition 16 (Theorem proving derivation) A theorem proving derivation is a 
sequence of sets of clauses Sp = S, S1, ... such that: 

(i) Deduction: S; = Si—1 U {C || ¢} for some C || ¢ if it can be deduced from 
premises in S;—ı by applying an inference rule in BP or basic E-simplification. 
(ii) Deletion: S; = S;-1 \ {D || Y} for some D || y if it is not protected, and is 
redundant or blocked in S;_,; modulo E. 


The set Sæ of persistent clauses is defined as U; (N;>; S4), which is called 
the limit of the derivation. A theorem proving derivation So, S1, S2,... is fair [6] 
w.r.t. the inference system BP if every inference 7 by BP with premises in Sx 
is redundant in (J; Sj modulo Æ. 


Definition 17 (Saturation w.r.t. relative reducibility) Given an equational the- 
ory E, we say that S modulo E is saturated under BP w.r.t. relative reducibility 
if every inference by BP with premises in S is redundant in S modulo E. 


In what follows, we say that a clause C || ¢é is non-protected redundant (resp. 
non-protected blocked) in S modulo F if it is not protected and is redundant 
(resp. blocked) in S modulo E. (If C || ¢ is non-protected redundant in S modulo 
E, then we also say that each ground instance Co of C||¢ is non-protected 
redundant in S modulo E.) 


Lemma 18 (i) If S CS", then any clause which is non-protected redundant or 
non-protected blocked in S modulo E is also non-protected redundant or non- 
protected blocked in S’ modulo E. 

(ii) Let S C S such that all clauses in S’ \ S are non-protected redundant 
or non-protected blocked in S’ modulo E. Then (ii.1) any clause which is non- 
protected redundant or non-protected blocked in S” modulo E is also non-protected 
redundant or non-protected blocked in S modulo E, and (ii.2) any inference which 
is redundant in S’ modulo E is also redundant in S modulo E. 


Lemma 19 Let So, S1,... be a fair theorem proving derivation w.r.t. BP such 
that So is a set of unconstrained clauses. Then Sx, modulo E is saturated under 
BP w.r.t. relative reducibility. 
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Proof. If S contains the empty clause, then it is immediate that S,, modulo E 
is saturated under BP w.r.t. relative reducibility, so we assume that the empty 
clause is not in So. 

If a clause C || ¢ is deleted in a theorem proving derivation, then we see that 
it is non-protected redundant or non-protected blocked in some S; modulo E. It 
is also non-protected redundant or non-protected blocked in |J, Sj modulo E by 
Lemma 18(i). Similarly, every clause in |J jS \ Soo is non-protected redundant 
or non-protected blocked in |J, Sj modulo Æ. 

Now by fairness of the derivation, every inference 7 by BP with premises in 
Soo is redundant in |J; Sj modulo Æ. Then by Lemma 18(ii.2) and the above, 7 
is also redundant in S,, modulo EF. Thus, S,, modulo E is saturated under bP 
w.r.t. relative reducibility. 


6 Refutational Completeness 


The soundness of BP (w.r.t.a fair theorem proving derivation) is straightfor- 
ward, i.e., S; U E H Si+ı UF for all i > 0. If the empty clause is in some S}, 
then So U E is unsatisfiable by the soundness of BP. The following theorem 
states that BP with our contraction rules (i.e. basic E-simplification and basic 
E-blocking) is refutationally complete. In order to prove the following theorem, 
we adapt a variant of model construction techniques [7—9,21,27]. In this section, 
we assume that the equality is the only predicate by expressing other predicates 
(i.e. predicate terms) as (predicate) equations as discussed in Section 4. 


Theorem 20 Let So, S1,... be a fair theorem proving derivation w.r.t. BP such 
that So is a set of unconstrained clauses. Then SoU E is unsatisfiable if and only 
if the empty clause is in some S;. 


Definition 21 (Model construction) Let S be a set of (constrained) clauses. We 
use induction on > to define the sets Rulesc, Ro, Ec, and Ic, for all ground 
instances C of clauses in S. Let C be such a ground instance of a clause in S and 
suppose that Rulesc has been defined for all ground instances C” of clauses in 
S for which C > C”. Then we define by Ro = Uc, Rulesc and by Ec the 
set of ground instances e1 ~% e2 of equations in E, such that C > e1 ~ e2, and 
eı and ez are both irreducible by Ro. We also define by Iç the interpretation 
(Ro U Ec)* (ie. the least congruence containing Ro U Ec). 

Now let C := DVs ~ t bea reduced ground instance of a clause in S w.r.t. Ro 
such that C is not an instance of a clause with a selected literal. Then C produces 
the set of ground rewrite rules Rulesc = {u > t|u ~g s and u is irreducible by 
Rc} if the following conditions are met: (1) Ic  C (resp. Ic 4 D) if C is an 
instance of a non-protected clause (resp. protected clause), (2) Ic Att’ for 
every s +t’ in D with s’ ~p s, (3) s ~ t is reductive for C, and (4) there exists 
u with u ~p s for which u is irreducible by Rc. We say that C is productive and 
produces Rulesc if it satisfies all of the above conditions. Otherwise, Rulesc = Í. 
Finally, we define Rg = Uc Ro, Es = Uc Ec, and Is = (Rs U Es)*. 
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We may include the special non-productive ground clause tt ~ tt in S for 
the above (inductive) definition, where tt ~ tt is assumed to be greater than all 
ground instances of clauses in SUF w.r.t. > other than tt ~ tt itself (see [21,27]). 
(If C is the strictly maximal ground instance among ground instances of clauses 
in S and is productive, then Rs may not include Rulesc by the above inductive 
definition of Rc without tt ~ tt.) In what follows, we say that a ground instance 
mo of an inference m with premises in S is reduced if each premise and conclusion 
of mo is a reduced ground instance of a clause in SU E w.r.t. Rg, Es. 


Definition 22 (Redundancy w.r.t. Rg, Es) A clause C||¢ is redundant in S 
modulo E w.r.t. Rs, Es if for every reduced ground instance Co w.r.t. Rg, Es, 
there exist reduced ground instances C101, . . ., Ckoz of clauses C1 || é1...Cr || ox 
in S wrt. Rg, Es, such that Co > Cici, 1 < i < k, and {Cyoy,...,Cyox} U 
Rg°? U EO” = Co. (In this case, we also say that each Co is redundant in S 
modulo E w.r.t. Rs, Es.) 

An inference r with conclusion D || ¢ is redundant in S modulo E w.r.t. Rs, Es 
if D ||¢ is blocked or a basic E-instance in S modulo E, or for every reduced 
ground instance to with maximal premise C and conclusion Do, there exist 
reduced ground instances C101,...,Crox of clauses Ci || ¢1,...,Cx||@~ in S 
w.r.t. Rg, Es, such that C > Cici, 1 < i < k, and {Cioi,...,Croxn} U Rg? U 
BAG } Do. 


Definition 23 (Saturation w.r.t. Rs, Es) Given an equational theory E, we 
say that S modulo E is saturated under BP w.r.t. Rs, Es if every inference by 
BP with premises in S is redundant in S modulo E w.r.t. Rg, Es. 


Lemma 24 (i) There are no overlaps among the left-hand sides of rules in Rs. 
(ii) A term t is reducible by Rs if and only if it is reducible by Rs, Es at the 
same position. 

(iii) For every l => r,s > t € Rs, ifl ~p s, then r and t are the same term. 
(iv) Rg/Eg is terminating. 

(v) For ground terms u and v, if Is Fury, then u Į Rs,Es v- 

(vi) If a ground instance C0 := DOV l0 ~ r0 of a clause C || ġ := DVI r||¢ 
is productive, then it is a reduced ground instance of C || ọ w.r.t. Rs, Es. 


The proofs of (i), (ii), and (iii) in Lemma 24 follow from the construction of 
Rg in Definition 21. For (iv), since Rg is contained in an E-compatible reduction 
ordering > on terms that is E-total on ground terms, Rg/Es is terminating. 
Meanwhile, Lemma 24(v) describes the ground Church-Rosser property [19] of 
Rs, Es. Since Rs/Es is terminating by (iv), this shows that Rg, Eg is ground 
convergent modulo Eg. In the following, we assume that any saturated clause 
set under BP is obtained from an initial set of clauses without constraints. 


Lemma 25 Let S modulo E be saturated under BP w.r.t. Rg, Es not contain- 
ing the empty clause and let C be a reduced ground instance of a clause in S 
w.r.t. Rg, Es or a ground instance of an equation in E. Then C is true in Ig. 
More specifically, 
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(i) C is not an instance of a blocked clause in S modulo E. 

(ii) If C is redundant in S modulo E w.r.t. Rg, Eg, then it is true in Is. 

(iii) If C is an instance of a clause with a selected literal, then it is true in Ig. 
(iv) If C contains a maximal negative literal (w.r.t. =) and is not an instance 
of a clause with a selected literal, then it is true in Ig. 

(v) If C is an instance of an equation in E, then it is true in Ig. 

(vi) If C is an instance of a protected clause or a basic E-instance of it, then it 
is true in Ig. 

(vii) If C is non-productive, then it is true in Ig. 

(viii) If C := C'V s x t is productive and produces Rulesc with s >t € Rulesc, 
then C" is false and C is true in Ig. 


We leave it to the reader to verify the following lemma using the definitions 
of redundancy of an inference w.r.t. relative reducibility and w.r.t. Rg, Es, along 
with Lemma 19. 


Lemma 26 Let So, 51,... be a fair theorem proving derivation w.r.t. BP such 
that So is a set of unconstrained clauses. Then Sœ modulo E is saturated under 
BP w.r.t. Rsu; Es. 


Theorem 27 Let So, 51,... be a fair theorem proving derivation w.r.t. BP such 
that So is a set of unconstrained clauses. If Ss does not contain the empty clause, 
then Ig So U E (i.e, So UE is satisfiable). 


Proof. By Lemma 26, we know that S, modulo E is saturated under BP 
w.r.t. Rg, Hs... Let C be a ground instance of an equation in FE or a ground 
instance of a clause C’ in Sọ. By Lemma 25(v), if C is a ground instance 
of an equation in F, then it is true in Ig... Therefore, we assume that C is 
not a ground instance of an equation in E. Suppose first that C := O'o’ is 
a reduced ground instance of C’ € So w.r.t. Rsu, Es. Then there are two 
cases to consider. If C’ € Sæ, then C is true in Is, by Lemma 25. Other- 
wise, if C” ¢ Sæ, then C” is (non-protected) redundant in some Sj modulo Æ 
w.r.t. relative reducibility because C’ € So (with the empty constraint) is neither 
protected nor can it be a blocked clause in some S; modulo Æ. Thus, C” is (non- 
protected) redundant in (J; S; modulo Æ w.r.t. relative reducibility, and hence 
is (non-protected) redundant in S,, modulo E w.r.t. relative reducibility by 
Lemma 18. It follows that there exist ground instances C)01,..., Cox of clauses 
C1 || é1,..-, Ck || k in Soo reduced relative to C, such that C > Cici, 1<i<k, 
and {Cjo1,..., Coz} U R3O U EXO E C for any ground rewrite system R con- 
tained in >. Since C is a reduced ground instance of C’ w.r.t. Rs, Es, we 
see that Cio;, 1 < i < k, are also reduced ground instances w.r.t. Rs, Es by 
Definition 7 and are true in Is„ by Lemma 25. Similarly, Reo and E~<° are 
true in Is _ by Lemma 25, and hence we may infer that C is also true in Is... 
Now suppose that C := C’o’ is a reducible ground instance of C’ € So 
w.r.t. Rss, Es. Let o” be a ground substitution such that xro” = £o'} Rs Bs, 
for each x € Vars(C’). Since O’o” is a reduced ground instance of C” € So 
wrt. Rg, Esn, Co” is true in Is Ţ7 by the previous paragraph, and hence C 
is also true in Ig... 


oo | 


180 D. Kim and C. Lynch 


We may now present the proof that BP with our contraction rules is refuta- 
tionally complete. 


Proof of Theorem 20 Let So, S1,... be a fair theorem proving derivation 
w.r.t. BP such that So is a set of unconstrained clauses. If the empty clause is in 
some S}, then So U E is unsatisfiable by the soundness of BP. Otherwise, if the 
empty clause is not in Sp for all k, then by the soundness of BP, Se does not 
contain the empty clause, and hence Sp U E is satisfiable by Theorem 27. 


7 Conclusion 


We have presented a basic paramodulation calculus modulo and provided a 
framework for equational theorem proving modulo equational theories Æ satis- 
fying some properties of E using constrained clauses, where a constrained clause 
may schematize a set of unconstrained clauses by keeping E-unification problems 
in its constraint part. Our results imply that we can deal uniformly with different 
equational theories E in our equational theorem proving modulo framework. We 
only need a single refutational completeness proof for our basic paramodulation 
calculus modulo Æ for different equational theories F. 

Our contraction techniques (i.e. basic E-simplification and basic E-blocking) 
for constrained clauses can also be applied uniformly for different equational 
theories Æ satisfying some properties of Æ in our equational theorem proving 
modulo framework. Since a constrained clause may schematize a set of uncon- 
strained clauses, the simplification or deletion of a constrained clause may cor- 
respond to the simplification or deletion of a set of unconstrained clauses. We 
have proposed a saturation procedure for constrained clauses based on relative 
reducibility and showed the refutational completeness of our inference system 
using a saturated clause set (w.r.t. >). 

Some possible improvements remain to be done. One of the main issues is 
the broadening the scope of our equational theorem proving modulo E to more 
equational theories E. This can be achieved by dropping or weakening some 
ordering requirements of > (e.g. monotonicity of +) for a basic paramodula- 
tion calculus modulo Æ, while maintaining the refutational completeness of the 
calculus (cf. [10]). This can also be achieved by finding suitable E-compatible 
orderings for more equational theories E. In fact, we provided an E-compatible 
simplification ordering > on terms that is E-total on ground terms for finite per- 
mutation theories E in [17], which allows us to provide a refutationally complete 
equational theorem proving with built-in permutation theories using the results 
of this paper. Since permutations play an important role in mathematics and 
many fields of science including computer science, we believe that developing 
applications for equational theorem proving with built-in permutation theories 
is another promising future research direction. 
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Abstract. The entailment problem @ |= yw in Separation Logic [12,15], between 
separated conjunctions of equational (x ~ y and x # y), spatial (x ++ (y1,.--,Yx«)) 
and predicate (p(x,...,Xn)) atoms, interpreted by a finite set of inductive rules, 
is undecidable in general. Certain restrictions on the set of inductive definitions 
lead to decidable classes of entailment problems. Currently, there are two such 
decidable classes, based on two restrictions, called establishment [10,13,14] and 
restrictedness [8], respectively. Both classes are shown to be in 2EXPTIME by 
the independent proofs from [14] and [8], respectively, and a many-one reduction 
of established to restricted entailment problems has been given [8]. In this paper, 
we strictly generalize the restricted class, by distinguishing the conditions that ap- 
ply only to the left- (@) and the right- (y) hand side of entailments, respectively. 
We provide a many-one reduction of this generalized class, called safe, to the es- 
tablished class. Together with the reduction of established to restricted entailment 
problems, this new reduction closes the loop and shows that the three classes of 
entailment problems (respectively established, restricted and safe) form a single, 
unified, 2EXPTIME-complete class. 


1 Introduction 


Separation Logic [12,15] (SL) was primarily introduced for writing concise Hoare logic 
proofs of programs that handle pointer-linked recursive data structures (lists, trees, etc). 
Over time, SL has evolved into a powerful logical framework, that constitutes the basis 
of several industrial-scale static program analyzers [3,2,5], that perform scalable com- 
positional analyses, based on the principle of local reasoning: describing the behavior 
of a program statement with respect only to the small (local) set of memory locations 
that are changed by that statement, with no concern for the rest of the program’s state. 

Given a set of memory locations (e.g., addresses), SL formule describe heaps, that 
are finite partial functions mapping finitely many locations to records of locations. A 
location £ is allocated if it occurs in the domain of the heap. An atom x > (y1,...,¥«) 
states that there is only one allocated location, associated with x, that moreover refers 
to the tuple of locations associated with (y1,..., yx), respectively. The separating con- 
junction ọ * y states that the heap can split into two parts, with disjoint domains, that 
make @ and y true, respectively. The separating conjunction is instrumental in support- 
ing local reasoning, because the disjointness between the (domains of the) models of its 
arguments ensures that no update of one heap can actually affect the other. 
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Reasoning about recursive data structures of unbounded sizes (lists, trees, etc.) is 
possible via the use of predicate symbols, whose interpretation is specified by a user- 
provided set of inductive definitions (SID) of the form p(x1,...,x,) 4 T, where p is 
a predicate symbol of arity n and the free variables of the formula m% are among the 
parameters x1,...,Xn of the rule. Here the separating conjunction ensures that each un- 
folding of the rules, which substitute some predicate atom p(y1,...,y,) by a formula 
[x1 /y1,---;Xn/Yn], corresponds to a way of building the recursive data structure. For 
instance, a list is either empty, in which case its head equals its tail pointer, or is built 
by first allocating the head, followed by all elements up to but not including the tail, as 
stated by the inductive definitions Is(x,y) =x = y and Is(x,y) = dz.x + (z) x Is(z, y). 


An important problem in program verification, arising during the construction of 
Hoare-style correctness proofs of programs, is the discharge of verification conditions 
of the form ọ |= y, where @ and y are SL formule, asking whether every model of 6 is 
also a model of y. These problems, called entailments, are, in general, undecidable in 
the presence of inductively defined predicates [11,1]. 


A first decidable class of entailments, described in [10], involves three restrictions 
on the SID rules: progress, connectivity and establishment. Intuitively, the progress (P) 
condition states that every rule allocates exactly one location, the connectivity (C) con- 
dition states that the set of allocated locations has a tree-shaped structure, and the es- 
tablishment (E) condition states that every existentially quantified variable from a rule 
defining a predicate is (eventually) allocated in every unfolding of that predicate. A 
2EXPTIME algorithm was proposed for testing the validity of PCE entailments [13,14] 
and a matching 2EXPTIME-hardness lower bound was provided shortly after [6]. 


Later work relaxes the establishment condition, necessary for decidability [7], by 
proving that the entailment problem is still in 2EXPTIME if the establishment condition 
is replaced by the restrictedness (R) condition, which requires that every disequality 
(x æ% y) involves at least one free variable from the left-hand side of the entailment, 
propagated through the unfoldings of the inductive system [8]. Interestingly, the rules of 
a progressive, connected and restricted (PCR) entailment may generate data structures 
with “dangling” (i.e. existentially quantified but not allocated) pointers, which was not 
possible with PCE entailments. 


In this paper, we generalize PCR entailments further, by showing that the connec- 
tivity and restrictedness conditions are needed only on the right-hand side of the en- 
tailment, whereas the only condition required on the left-hand side is progress (which 
can usually be enforced by folding or unfolding definitions). Our results thus allow for 
“asymetric’” entailments, i.e., one can test whether the structures described by induc- 
tive rules that are (almost) arbitrary fulfill some restricted formula. Although the class 
of data structures that can be described is much larger, we show that this new class of 
entailments, called safe, is also 2EXPTIME-complete, by a many-one reduction of the 
validity of safe entailments to the validity of PCE entailments. A second contribution 
of the paper is the cross-certification of the two independent proofs of the 2EXPTIME 
upper bounds, for the PCE [6,14,8] and PCR [8] classes of entailments, respectively, 
by closing the loop. Namely, the reduction given in this paper enables the translation 
of any of the three entailment problems into an equivalent problem in any other class, 
while preserving the 2EXPTIME upper bound. This is because all the reductions are 
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polynomial in the overall size of the SID and singly-exponential in the maximum size 
of the rules in the SID. The theoretical interest of the reduction is that it makes the proof 
of decidability and of the complexity class much shorter and clearer. It also has some 
practical advantages, since it allows one to re-use existing implementations designed 
for established systems instead of having to develop entirely new automated reasoning 
systems. Due to space restrictions, some of the proofs are omitted. All proofs can be 
found in [9]. 


2 Definitions 


For a (partial) function f : A — B, we denote by dom(f) and rng(f) its domain and 
range, respectively. For a relation R C A x A, we denote by R* the reflexive and transitive 
closure of R. 

Let « be a fixed natural number throughout this paper and let P be a countably 
infinite set of predicate symbols. Each predicate symbol p € P is associated a unique 
arity, denoted ar(p). Let V be a countably infinite set of variables. For technical con- 
venience, we also consider a special constant L, which will be used to denote “empty” 
record fields. Formulæ are built inductively, according to the following syntax: 


Q i= x kx xax |x iY) PE-Xa) 102 |9 Vo | Ae. 1 


where p € P is a predicate symbol of arity n = ar(p), x,x’,x1,..-,X» € V are variables 
and y1,...,Yx E VU{L} are terms, i.e. either variables or L. 

The set of variables freely occurring in a formula ọ is denoted by fv(o), we assume 
by o-equivalence that the same variable cannot occur both free and bound in the same 
formula 6, and that distinct quantifiers bind distinct variables. The size |b| of a formula 
is the number of occurrences of symbols in ọ. A formula x ~ x’ or x % x’ is an equa- 
tional atom, x ++ (y1,---,Y«) is a points-to atom, whereas p(x1,...,Xn) is a predicate 
atom. Note that L cannot occur in an equational or in a predicate atom. A formula is 
predicate-less if no predicate atom occurs in it. A symbolic heap is a formula of the form 


ak. K” 1j, where each Q; is an atom and x is a possibly empty vector of variables. 


Definition 1. A variable x is allocated by a symbolic heap 9 iff ọ contains a sequence 
of equalities x) S x2... Xn—-1 © Xn, for n> 1, such that x =x, and xn > (y1,.--,Y«) 
occurs in Q, for some variables x1,...,X, and some terms y1,... yk EVU{ L}. 


A substitution is a partial function mapping variables to variables. If © is a substitution 
and @ is a formula, a variable or a tuple, then do denotes the formula, the variable or 
the tuple obtained from by replacing every free occurrence of a variable x € dom(o) 
by o(x), respectively. We denote by {(x;,y;) | i € [1,n]} the substitution with domain 
{x,,...,X,} that maps x; to y;, for each i € [1,n]. 

A set of inductive definitions (SID) ® is a finite set of implications (or rules) of the 
form p(x1,.--,%n) <=, where p € P, n =ar(p),x1,...,X, are pairwise distinct variables 
and T is a quantifier-free symbolic heap. The predicate atom p(x),...,Xn) is the head of 
the rule and R (p) denotes the subset of R, consisting of rules with head p(x1,...,%n) 
(the choice of x;,...,x, is not important). The variables in fv(z) \ {x1,...,x,} are called 
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the existential variables of the rule. Note that, by definition, these variables are not 
explicitly quantified inside 7 and that 7 is quantifier-free. For simplicity, we denote by 


P(X1,---,Xn) <a T the fact that the rule p(x1,...,x,) & T belongs to R. The size of R is 


defined as |R | = pli -mnleQn |n|+n and its width as w( R) = max p(x, oe m)Eegn [r| + 


n. 
We write p =x q, p,q € P iff R, contains a rule of the form p(x1,...,Xn) & T, and q 
occurs in 7%. We say that p depends on q if p ZR q. For a formula 6, we denote by P(0) 
the set of predicate symbols q, such that p =R q for some predicate p occurring in Q. 
Given formule ọ and y, we write 0 =» y if y is obtained from ọ by replacing an 
atom p(u1,...,Un) by W{(x1,u1),---, (Xn, Un) }, where R contains a rule p(x,...,%») = 
T. We assume, by a renaming of existential variables, that the set (fv(z) \ {x1,..-,%})A 
fv() is empty. We call y an unfolding of 9 iff 6 =} Y. 
We now define the semantics of SL. Let £ be a countably infinite set of locations 
containing, in particular, a special location IL. A structure is a pair (5,4), where: 
- sis a partial function from VU {L} to £, called a store, such that L € dom(s) and 
s(x) =1L 4> x= L, forallxe VU{L}, and 
- h: L —> L“ isa finite partial function, such that IL ¢ dom(h). 
If x,,...,X» are pairwise distinct variables and ¢),...,¢, € £ are locations, we denote by 
|x; + 4i | 1 < i < n] the store s’ defined by dom(s’) = dom(s) U {x1,... xn}, S'Y) = 4i 
if y = x; for some i € [1,n], and s’(y) = s(x) otherwise. If x1,...,x, ¢ dom(s), then the 
store s’ is called an extension of 5 to {x1,...,xXn}. 
Given a heap b, we define ref(h) = Uricaomo) {4 | 6(2) = (41,--- 4), 2 € [1, K] } and 


def 


loc(h) = dom(b)Uref(h). Two heaps bı and bo are disjoint iff dom(h1) Ndom(h2) = 9, 
in which case bı W hz denotes the union of h; and h2, undefined whenever h; and bz are 
not disjoint. 

Given an SID R, (5,6) Fe 0 is the least relation between structures and formule 
such that whenever (s,) =g Q, we have fv() C dom(s) and the following hold: 


(5,6) Fa xx’ if dom(h) = 0 and s(x) = s(x’) 
(5,6) Fa x ex’ if dom(h) = 0 and s(x) 4 5(x’) 
(5,5) Eg x (tse) if domlh) = {5(x)} and b(5(2)) = (s(01),---45(0%)) 
(5,6) Fe 01 * 02 if there exist disjoint heaps hı and b2 such that 

b = bi Hho and (s,h;) Ex Qi, for both i = 1,2 
(5,6) Fe O1V 02 if (5,5) Eg Q;, for some i= 1,2 
(5,6) Fe Ix. if there exists l € L such that (s|x + ¢],h) Eo 
(5,9) Fe p(x1,.--,%n) if p(x1,..-,X%n) =a Q, and there exists a store Se 


coinciding with $ on {x1,...,x,}, such that (se, b) H ọ 


Given formule and y, we write Fe y whenever (5,5) Fa > > (5,6) Fe Y, 
for all structures (s,5) and ġ =g y for ( Fe wy and y =g >). We omit the subscript 
R, whenever these relations hold for any SID. It is easy to check that, for all formule 
01,02, 4, it is the case that (0; V2) «y= (01 * y) V (62 *W) and (Av.01) * 02 = 3x . Q1 * 
¢2. Consequently, each formula can be transformed into an equivalent finite disjunction 
of symbolic heaps. 
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Definition 2. An entailment problem is a triple B 2 OF Rg Y, where is a quantifier- 
free formula, y is a formula and Ris an SID. The problem is valid iff 0 =R y. The 


def 


size of the problem 3B is defined as |\B| = |o|+ |w|+|R| and its width is defined as 
def 
w(B) = max(|9], yl, w(R)). 


Note that considering @ to be quantifier-free loses no generality, because 3x. Fe 
y <= OF ev. 


3 Decidable Entailment Problems 


The class of general entailment problems is undecidable, see Theorem 5 below for a 
refinement of the initial undecidability proofs [11,1]. A first attempt to define a natural 
decidable class of entailment problems is described in [10] and involves three restric- 
tions on the SID rules, formally defined below: 


Definition 3. A rule p(x1,...,Xn) = T is: 
1. progressing (P) iffm = x1 > (y1,.--,¥«) *P and p contains no points-to atoms, 
2. connected (C) iff it is progressing, T = x1 +> (y1,---,Y«) * p and every predicate 
atom in p is of the form q(y;,u), for some i € |1,x], 
3. established (E) iff every existential variable x € fv(m) \ {x1,...,Xn} is allocated by 
every predicate-less unfolding T =r o. 
An SID R, is P (resp. C, E) for a formula 6 iff every rule in pepi) R(p) is P (resp. 
C,E). An entailment problem og y is left- (resp. right-) P (resp. C, E) iff R is P (resp. 
C, E) for 0 (resp. W). An entailment problem is P (resp. C, E) iff it is both left- and 
right-P (resp. C, E). 


The decidability of progressing, connected and left-established entailment problems is 
an immediate consequence of the result of [10]. Moreover, an analysis of the proof 
[10] leads to an elementary recursive complexity upper bound, which has been recently 
tighten down to 2EXPTIME-complete [14,8,6]. In the following, we refer to Table 1 
for a recap of the complexity results for the entailment problem. The last line is the 
main result of the paper and corresponds to the most general (known) decidable class 
of entailment problems (Definition 8). 


Table 1. Decidability and Complexity Results for the Entailment Problem (v means that the 
corresponding condition holds on the left- and right-hand side of the entailment) 


| Reference |Progress|Connected|Established|Restricted|Complexity 
| Theorem 4 v v left - 2EXP-co. 

| Theorem 5 v left v - undec. 

| [7, Theorem 6] v v - - undec. 
|[8, Theorem 32]| v y - v 2EXP-co. 

| Theorem 31 v right - right 2EXP-co. 


The following theorem is an easy consequence of previous results [6]. 
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Theorem 4. The progressing, connected and left-established entailment problem is 
2EXPTIME-complete. Moreover, there exists a decision procedure that runs in time 

O(w(P)8 log% i ; 
a eel for every instance $ of this problem. 

A natural question arises in this context: which of the restrictions from the above 
theorem can be relaxed and what is the price, in terms of computational complexity, of 
relaxing (some of) them? In the light of Theorem 5 below, the connectivity restriction 
cannot be completely dropped. Further, if we drop the establishment condition, the 
problem becomes undecidable [7, Theorem 6], even if both the left/right progress and 
connectivity conditions apply. 


Theorem 5. The progressing, left-connected and established entailment problem is un- 
decidable. 


The second decidable class of entailment problems [8] relaxes the connectivity con- 
dition and replaces the establishment with a syntactic condition (that can be checked 
in polynomial time in the size of the SID), while remaining 2EXPTIME-complete. In- 
formally, the definition forbids (dis)equations between existential variables in symbolic 
heaps or rules: the only allowed (dis)equations are of the form x > y where x is a free 
variable (viewed as a constant in [8]). The definition given below is essentially equiv- 
alent to that of [8], but avoids any reference to constants; instead it uses a notion of 
R -positional functions, which helps to identify existential variables that are always re- 
placed by a free variable from the initial formula during unfolding. 

An R -positional function maps every n-ary predicate symbol p occurring in R toa 
subset of [1,7]. Given an 8 -positional function À and a formula 6, we denote by V} (ọ) 
the set of variables x; such that ọ contains a predicate atom p(x1,...,%) with i € A(p). 
Note that V, is stable under substitutions, i.e. V} (þo) = (V_(0))o, for each formula ọ 
and each substitution ©. 


Definition 6. Let y be a formula and R, be an SID. The fv-profile of the pair (y, R) 
is the R -positional function À such that the sets A(p), for p € P, are the maximal sets 
satisfying the following conditions: 


I. Va(w) E fv(y). 
2. For all predicate symbols p € P(w), all rules p(x1,...,Xn) =T in R, all predicate 


atoms q(y1,---;¥m) in T and alli € X(q), there exists j € Mp) such that xj = yi. 
The fv-profile of (y, R) is denoted by Ap- 


Intuitively, given a predicate p € P, the set Ar (p) denotes the formal parameters of p 
that, in every unfolding of y, will always be substituted by variables occurring freely 
in y. It is easy to check that Ap can be computed in polynomial time w.r.t. |y] + |], 
using a straightforward greatest fixpoint algorithm. The algorithm starts with a function 
mapping every predicate p of arity n to [1,n] and repeatedly removes elements from 
the sets A(p) to ensure that the above conditions hold. In the worst case, we may have 
eventually A(p) = 0 for all predicate symbols p. 


Definition 7. Let À be an R -positional function, and V be a set of variables. A formula 
Q is A-restricted (A-R) w.rt. V iff the following hold: 
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1. for every disequation y % z in Q, we have {y,z} NV 40, and 
2. Vi(o) CV. 
A rule p(x1,---;Xn) HX (y1,---, Va) *P is: 
— A-connected (A-C) iff for every atom q(z1,...,Zm) occurring in p, we have zı € 
Val Cs. 5%n)) UDI + deh 
— A-restricted (A-R) iff p is A-restricted w.r.t. Vy (p(xX1,---,Xn)). 
An SID Ris P (resp. 4-C, 4-R) for a formula 9 iff every rule in Upp) R(p) is P 
(resp. A-C, 0-R). 
An SID Ris A-C (A-R) for a formula 6 iff every rule in Uner(o) R (p) is à-C (A-R). 
An entailment problem Fg w is left- (right-) A-C, (A-R) iff R is A-C (A-R) for > (W), 
where À is considered to be Me (Np) An entailment problem is A-C (A-R) iff it is both 
left- and right--C (X-R). 


The class of progressing, A-connected and A-restricted entailment problems has been 
shown to be a generalization of the class of progressing, connected and left-established 
problems, because the latter can be reduced to the former by a many-one reduction [8, 
Theorem 13] that runs in time |%$]| -20(B)*) on input $ (Figure 1) and preserves the 
problem’s width asymptotically. 


Fig. 1. Many-one Reductions between Decidable Entailment Problems 


progressing 
right A-connected 
right A-restricted 
(safe) 


progressing pty : 20(w(B)*) 
connected 
left established 


progressing 
A-connected 
A-restricted 


In the rest of this paper we close the loop by defining a syntactic extension of A- 
progressing, A-connected and A-restricted entailment problems and by showing that 
this extension can be reduced to the class of progressing, connected and left-established 
entailment problems by a many-one reduction. The new fragment is defined as follows: 


Definition 8. An entailment problem ọ įg Y is safe if, for À 5 


I. every rule in R is progressing, 
2. wis A-restricted w.r.t. fv(ọ), 
3. all the rules from Upepo) R(p) are -connected and h-restricted. 


Kri the following hold: 


Note that there is no condition on the formula ọ, or on the rules defining the predicates 
occurring only in @, other than the progress condition. The conditions in Definition 
8 ensure that all the disequations occurring in any unfolding of y involve at least one 
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variable that is free in ọ. Further, the heaps of the model of y must be forests, i.e. unions 
of trees, the roots of which are associated with the first argument of the predicate atoms 
in w or to free variables from 0. 

A typical yet very simple example of such an entailment is the so-called “reversed 
list” problem that consists in checking that any list segment revls(z,y) defined in the 
reverse direction (from the tail to the head) is a list segment 1s(x, y) in the usual sense 
(defined inductively from head to tail). This corresponds to the entailment problem 
revis(z,y) Fg dx.1s(x,y) where R, contains the following rules: 


1s(x,y) =x (y) revis(z,y) =z (y) 
1s(x,y) =x (z)*1s(z,y) revls(z,y) 4 z > (y) *revls(u,z) 


This problem is considered as challenging for proof search-based automated reasoning 
procedures (see, e.g., [4,16]). The antecedent does not fulfill the connectivity condition, 
but the subsequent does, hence the entailment is safe. Similar, more complex examples 
can be defined, for instance a list can be constructed by interleaving elements at odd or 
even positions. Another example is the case of a data structure containing an unbounded 
number of acylic lists (e.g., a list of acyclic lists). Such a data structure does not fulfill 
the restricteness condition, since one needs to compare the pointers occurring along 
each list to the point at the end. Checking, for instance, that the concatenation of two 
lists of acyclic lists is again a list of (possibly cyclic) lists is a problem that fits into the 
safe class and can thus be effectively checked by our algorithm. 

We refer the reader to Figure 1 for a general picture of the entailment problems 
considered so far and of the many-one reductions between them, where the reduction 
corresponding to the dashed arrow is the concern of the next section. Importantly, since 
all reductions are many-one, taking time polynomial in the size and exponential in the 
width of the input problem, while preserving its width asymptotically, the three classes 
from Figure | can be unified into a single (2EXPTIME-complete) class of entailments. 


4 Reducing Safe to Established Entailments 


In a model of a safe SID (Definition 8), the existential variables introduced by the 
replacement of predicate atoms with corresponding rule bodies are not required to be 
allocated. This is because safe SIDs are more liberal than established SIDs and allow 
heap structures with an unbounded number of dangling pointers. As observed in [8], 
checking the validity of an entailment (w.r.t a restricted SID) can be done by considering 
only those structures in which the dangling pointers point to pairwise distinct locations. 
The main idea of the hereby reduction of safe to established entailment problems is that 
any such structure can be extended by allocating all dangling pointers separately and, 
moreover, the extended structures can be defined by an established SID. 

In what follows, we fix an arbitrary instance Ẹ = ọ Fg y of the safe entailment 
problem (Definition 8) and denote by À 2 Me the fv-profile of (y, R) (Definition 
6). Let w = (w1,...,wy) be the vector of free variables from ọ and y, where the or- 
der of variables is not important and assume w.l.o.g. that v > 0. Let P, £ P(o) and 
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P= = Pw) be the sets of predicate symbols that depend on the predicate symbols oc- 
curring in the left- and right-hand side of the entailment, respectively. We assume that 
¢ and y contain no points-to atoms and that P, N P, = 0. Again, these assumptions lose 
no generality, because a points-to atom u > (v1,...,Vę) can be replaced by a predi- 
cate atom p(u,v1,...,V«), where p is a fresh predicate symbol associated with the rule 
P(X, Y1,- -;YK) HX (y1,---,¥«)- Moreover the condition P; NP, #0 may be enforced 
by considering two copies of each predicate, for the left-hand side and for the right-hand 
side, respectively. Finally, we assume that every rule contains exactly u existential vari- 
ables, for some fixed u € N; this condition can be enforced by adding dummy literals 
x ~ x if needed. 

We describe a reduction of ‘8 to an equivalent progressing, connected, and left- 
established entailment problem. The reduction will extend heaps, by adding v + u record 
fields. We shall therefore often consider heaps and points-to atoms having K +V + u 
record fields, where the formal definitions are similar to those given previously. Usu- 
ally such formule and heaps will be written with a prime. These additional record fields 
will be used to ensure that the constructed system is connected, by adding all the exis- 
tential variables of a given rule (as well as the variables in w1,..., wy) into the image of 
the location allocated by the considered rule. Furthermore, the left-establishment condi- 
tion will be enforced by adding predicates and rules in order to allocate all the locations 
that correspond to existential quantifiers and that are not already allocated, making such 
locations point to a dummy vector L £ (L,...,L), of length K+V +u, where L is the 
special constant denoting empty heap entries. To this aim, we shall use a predicate sym- 
bol L associated with the rule L(x) 4 x > L. Note that allocating all these locations 
will entail (by definition of the separating conjunction) that they are distinct, thus the 
addition of such predicates and rules will reduce the number of satisfiable unfoldings. 
However, due to the restrictions on the use of disequations?, we shall see that this does 
not change the status of the entailment problem. 


Definition 9. For any total function y : L —> L and any tuple £ = (¢1,...,n) € L”, we 
denote by y(@) the tuple (y(€1),..-,Y¥(en)). If 5 is a store, then y(s) PEA the store 


def 


with domain dom(s), such that y(s)(x) = y(s(x)), for all x € dom(s). Consider a heap 
b such that for all £4 ¢' € dom(b), we have y(£) # y(%). Then y(b) denotes the heap 
with domain dom(y(h)) = {y(2) | £ € dom(h)}, such that ¥(h)(y(2)) = = y(h(2)), for all 
£ € dom(b). 


The following lemma identifies conditions ensuring that the application of a map- 
ping to a structure (Definition 9) preserves the truth value of a formula. 


Lemma 10. Given a set of variables V, let à be a formula that is -restricted w.r.t. 
V, such that P(Q) C P, and let (s,6) be an R -model of a. For every mapping Y: 
L —> L such that y(t) = yL) = L= l holds whenever either {£,0'} C dom(h) or 
{4,0} NOs(V) #0, we have (y(s),¥(5)) Hg & 


If yis, moreover, injective, then the result of Lemma 10 holds for any formula: 


Lemma 11. Let © be a formula and let (5,5) be an R -model of a. For every injective 
mapping y: L —> L we have (y(s),¥()) Fx &. 
3 Point (1) of Definition 7 in conjunction with point (2) of Definition 8. 
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Fig. 2. Heap Expansion and Truncation 


eee 


yai) 


4.1 Expansions and Truncations 


We introduce a so-called expansion relation on structures, as well as a truncation op- 
eration on heaps. Intuitively, the expansion of a structure is a structure with the same 
store and whose heap is augmented with new allocated locations (each pointing to L) 
and additional record fields, referring in particular to all the newly added allocated lo- 
cations. These locations are introduced to accommodate all the existential variables 
of the predicate-less unfolding of the left-hand side of the entailment (to ensure that 
the obtained entailment is left-established). Conversely, the truncation of a heap is the 
heap obtained by removing these extra locations. We also introduce the notion of a 
y-expansion which is a structure whose image by yis an expansion. 

We recall that, throughout this and the next sections, w = (w1,...,wy) denotes the 
vector of free variables occurring in the problem, which is assumed to be fixed through- 
out this section and that {w1,...,wy, L} C dom(s), for every store s considered here. 
Moreover, we assume w.l.o.g. that w1,...,wy do not occur in the considered SID R, and 
denote by u the number of existential variables in each rule of R. We refer to Figure 2 
for an illustration of the definition below: 


Definition 12. Let y: L —> L be a total mapping. A structure (s,6') is a y-expansion 
(or simply an expansion ify = id) of some structure (s,), denoted by (s,6') Py (5,6), if 
b: L> LS, b!: L— LS and there exist two disjoint heaps, main(h’) and aux(b’), 
such that b! = main(h’) w aux(b’) and the following hold: 
I. for all £1, 02 € dom(main(h’)), if y(t) = y(t) then l1 = bo, 
2. y(dom(main(h’))) = dom(b), 
3. for each £ € dom(main(h’)), we have h' (£) = (a,s(w),bf,... bf), for some loca- 
tions bi... b € Land ya) = b(y(4)), 
4. for each L € dom(aux(h’)), we have b'(£) = JL and there exists a location } € 
dom(main(h’)) such that main(b’)(¢’) is of the form (a,l, bf bk) where £ is 
a tuple of locations and £ = be, for some i € |1,u]. The element l is called the 
connection of / in b’ and is denoted by Cy (0).4 


Let (s, b’) be a y-expansion of (s, h) and let € dom(main(h’)) be a location. Since v > 0 
and for all i € [1,v], s(w;) occurs in b' (£), and since we assume that s(w;) # IL = s(1) 
for every i € [1,v], necessarily main(h’)(¢) 4 IL. This entails that the decomposition 


4 Note that ¢’ does not depend on y, and if several such locations exist, then one is chosen 
arbitrarily. 
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bh’ = main(bh’) W aux(h’) is unique: main(h’) and aux(bh’) are the restrictions of b’ to the 
locations ¢ in dom(b’) such that ’(£) 4 JL and 6’ (£) = AL, respectively. In the following, 
we shall thus freely use the notations aux(h’) and main(h’), for arbitrary heaps b’. 


Definition 13. Given a heap b’, we denote by trunc(h’) the heap h defined as follows: 
dom(h) “= dom(h’) \ {£ € dom(h’) | h/(¢) = IL} and for all £ € dom(b), if '(¢) = 
(41, pa , letvtu) then b£) = (4 peig ,&). 

Note that, if h = trunc(h’) then b : £ > L" and 6’: £ > LY are heaps of differ- 


ent out-degrees. In the following, we silently assume this fact, to avoid cluttering the 
notation by explicitly specifying the out-degree of a heap. 


Example 14. Assume that £ = N, v = u = 1. Let s be a store such that s(w1) = 0. We 
consider: 


b = £(1,2), (2,2)}, 
by = {(1, (2,0,1)),(2, (2,0,3)), (3,(1,1,1))}, 
bo = {(1, (3,0, 1)), (2, (4,0,3)), (3,(L,1,1))}- 


We have (s,6/) Pia (5,6) and (5,65) >y(s,6), with y= {(1,1), (2,2), (3,2), (4,2) }. Also, 
trunc(h,) = {(1,2), (2,2)} = h and trunc(b4) = {(1,3), (2,4)}. Note that h has out- 
degree K = 1, whereas hj and b4 have out-degree 3. a 


Lemma 15. Jf (5,5')>y (5,6) then h = y(trunc(h’)), hence (s,6') Pia (s,trunc(h’)). 


The converse of Lemma 15 does not hold in general, but it holds under some addi- 
tional conditions: 


Lemma 16. Consider a store s, let h! be a heap and let h = trunc(b’). Let Dy = {0 € 


dom(h’) | h(¢) = 4L} and Dı © dom(b’) \ Do. Assume that: 
1. for every location £ € Dy, )(€) is of the form (€1,...,£«) and Ņ' (£) is of the form 
Cree b sw), veneers ae 
2. every location l € D2 has a connection in by. 


Then (s,6') Pia (5,5). 


4.2 Transforming the Consequent 


We first describe the transformation for the right-hand side of the entailment problem, 
as this transformation is simpler. 


Definition 17. We associate each n-ary predicate p € P, with a new predicate p of arity 
n-+v. We denote by Q the formula obtained from a by replacing every predicate atom 
P(X1,---,Xn) by p(x1,---,%n,W), where w = (w1,...,Wy)- 


Definition 18. We denote by R the set of rules of the form: 
P(X,- -Xn W) & x1 > (Y1; -2-3 Ye W215 +- -Zu )O * PO * Ey * Xo 


where: 
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= phi... Xn) € x1 > (V1,---, Ye) *P is a rule in R with p € P, 
Z1,- -- ,Zu are variables not occurring in fv(P) U {x1,.--,Xn,Y1;---;Yk; W1; -Wy h 
- © is a substitution with dom(o) C fv (p) \ {x1} and mg(o) C {w1,... wv}, 
Er = kierL (zi), with IC {1,...,u}, 
def 
=o = * xedom(o)* X xO. 


We denote by R, the set of rules in R, that are connected’. 


Note that the free variables w are added as parameters in the rules above, instead of 
some arbitrary tuple of fresh variables @, of the same length as w. This is for the sake 
of conciseness, since these parameters @ will be systematically mapped to w. 


Example 19. Assume that y = 3x . p(x,w,), with v = 1, u = 1 and A(p) = {2}. Assume 
also that p is associated with the rule: p(u1,u2) < u1 +> u1 *q(uz). Observe that the rule 
is A-connected, but not connected. Then dom(o) C {u2}, rg(o) C {w1} and J C {1}, 
so that R, contains the following rules: 


(1) p(u1,u2,w1) Hu > (u1,w1,z1)*q(u2) 

(2) p(u1,u2,w1) = u + (u1,w1,21)* q(u2)* L(z1) 

(3) p(u1,u2,w1) Hu > (u1,w1,z1)*q(w1) * u2 ~ w1 

(4) p(lu1,u2,w1) = uy > (u1, w1,z1)*q(w1)* L(z1)*u2 & w1 


Rules (1) and (2) are not connected, hence do not occur in R,. Rules (3) and (4) are 
connected, hence occur in R,. Note that (4) is established, but (3) is not. | 


We now relate the SIDs R, and R, by the following result: 


Lemma 20. Let © be a formula that is X-restricted w.rt. {w1,...,wy} and contains no 
points-to atoms, with P(a) C P.. Given a store s and two heaps h and /, such that 


(s,6") Dia (8,6), we have (s,h') a, @ if and only if (5,6) =g a. 


4.3 Transforming the Antecedent 


We now describe the transformation operating on the left-hand side of the entailment 
problem. For technical convenience, we make the following assumption: 


Assumption 21. We assume that, for every predicate p € P, every rule of the form 


P(x1,---,%n) = T in R and every atom q(x},...,X),) occurring in T, x, Z {x1,.-.,Xn}- 


This is without loss of generality, because every variable x} € {x1,...,%,} can be re- 
placed by a fresh variable z, while conjoining the equational atom z ~ x} to m. Note that 
the obtained SID may no longer be connected, but this is not problematic, because the 
left-hand side of the entailment is not required to be connected anyway. 


Definition 22. We associate each pair (p,X), where p € P}, ar(p) =n and X C [1,n], 
with a fresh predicate symbol px, such that ar(px) =n +v. A decoration of a formula a 
containing no points-to atoms, such that P(&) C P, is a formula obtained by replacing 
each predicate atom B = q(¥15-++;¥m) in Q by an atom of the form IX, (Y1; --3Ym, W), 
with Xg C [1,m]. The set of decorations of a formula a is denoted by D(a). 


5 Note that all the rules in R are progressing. 
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The role of the set X in a predicate atom py (x1,...,X,,w) will be explained below. Note 
that the set of decorations of an atom Q is always finite. 


Definition 23. We denote by D(R) the set of rules of the form 
px (x1, tee „Xn, W) eX (yı pee YK W, Z1; ,Zu)O * p’ * * jer L (zi), 


where: 
= p(xX1,---;Xn) € x1 (y1,---, Ye) *P is a rule in R and X C [I1,n]; 
= {Zr = (fv(p)U {y1,---: Yeh) \{15-+-:%nf 
- ois a substitution, with dom(o) C {z1,...,Z,} andrng(o) C {x1,...,%n,W1,--.,Wv, 
FAE TET Zuk 
— 9’ is a decoration of po; 
- IC {1,...,u} and zi ¢ dom(o), for alli € 1. 


Lemma 24. Let a be a formula containing no points-to atom, with P(a) C P, and let 
a’ be a decoration of à. If (s,6') FpcR) & and (s,6') Pia (s,6), then (5,6) Fe a. 


At this point, the set X for predicate symbol pyx is of little interest: atoms are simply 
decorated with arbitrary sets. However, we shall restrict the considered rules in such 
a way that for every model (5,6) of an atom px(x1,...,Xn+y), with n = ar(p), the set 
X denotes a set of indices i € [1,7] such that s(x;) € dom(h). In other words, X will 
denote a set of formal parameters of py that are allocated in every model of px. 


Definition 25. Given a formula a, we define the set Alloc(a) as follows: x € Alloc(&) 
iff & contains either a points-to atom of the form x ++ (y1,..-,Yx+u+v), or a predicate 
atom qx (X}5+++.X%mpy) with x, = x for some i € X. 


Note that, in contrast with Definition 1, we do not consider that x € Alloc(a), for those 
variables x related to a variable from Alloc(a) by equalities. 


Definition 26. A rule px (x1,.--,Xntv) = T in D(R) with n = ar(p) with p = x; > 
(Y1, ---3Yk;W,Z1;---,Zu) *p’ is well-defined if the following conditions hold: 

1. {xı} C Alloc(px (x1,--.,;Xn4v)) C Alloc(n); 

2. fv(m) C Alloc(n)U {x1,...,Xn4v} 
We denote by R the set of well-defined rules in D(R). 


We first state an important properties of R. 
Lemma 27. Every rule in Q is progressing, connected and established. 
We now relate the systems R, and &; by the following result: 


Definition 28. A store s is quasi-injective if for all x,y E€ dom(s), the implication 
s(x) = s(y) > x = y holds whenever {x,y} Z {w1,..., wv}. 


Lemma 29. Let L be an infinite subset of L. Consider a formula Q containing no 
points-to atom, with P(a) C ®, and let (s,h) be an R-model of 0, where s is quasi- 
injective, and (rng(s) Uloc(h)) AL = 0. There exists a decoration o of a, a heap b! and 
a mapping y : L —> L such that: 
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(s, b’) Dy (s, b), 
if l £ L then y(t) = 4 
= loeli) \mg(s) CL 
dom(aux(h’)) C L and 
— (5,0) Eg 0. 
Furthermore, if s(u) € dom(6’) \ {s(w;) | 1 < i < v} then u € Alloc(o’). 


4.4 Transforming Entailments 


We define R, £ RiU R,. We show that the instance p Fg y of the safe entailment prob- 
lem can be solved by considering an entailment problem on R involving the elements 
of D() (see Definition 22). Note that the rules from R; are progressing, connected and 
established, by Lemma 27, whereas the rules from R, are progressing and connected, 
by Definition 18. Hence, each entailment problem 0’ + R W, where ¢' € D(6), is pro- 
gressing, connected and left-established. 


Lemma 30. 0 =g y ifand only if Venio) y TR Į. 


Proof. “=>” Assume that ọ =g Y and let ọ' € D(@) be a formula, (5, 6’) be an R -model 


of ọ' and b = trunc(b'). By construction, (s,h’) is an ®-model of 6’. By definition of 
D(), 0! is a decoration of . Let Dz = {£ € dom(h’) | 6/(£) = 1L}, Dı & dom(h’) \ Do, 
and consider a location £ € dom(b’). By definition, @ must be allocated by some rule 
in Q. If Z is allocated by a rule of the form given in Definition 23, then necessarily 
b’(¢) is of the form (¢1,...,0«,5(w),¢),---,@,) and £ € Dy. Otherwise, £ is allocated 
by the predicate L and we must have / € Dz by definition of the only rule for L. 
Since this predicate must occur within a rule of the form given in Definition 23, / 
necessarily occurs in the u last components of the image of a location in Dı, hence 
admits a connection in b’. Consequently, by Lemma 16 (s, b’) Dia (5,6), and by Lemm: 
24, (s,6) Fx >. Thus (5,5) x Y, and by Lemma 20, (s, b’) a, W, thus (s,5’) Fg W. 
“<=” Assume that Veena) Q ER W and let (s,h) be a R-model of . Since the 
truth values of ọ and w depend only on the variables in fv(o) Ufv(y), we may assume, 
w.l.o.g., that s is quasi-injective. Consider an infinite set L C £ such that (rng(s) U 
loc(h)) OL = 0. By Lemma 29, there exist a heap b’, a mapping y: £ > £ anda 
decoration 6’ of ọ such that y(¢) = £ for all £ ¢ L, (5, 6’) >y (s, h) and (s,6’) = 6’. Since 
mg(s) N L = 0, we also have y(s) = s. Then (s,h’) K Ẹ. Let hı = trunc(h’). Since 
(5, h’) by (8,6), by Lemma 15 we have (s,6’) Dia (5,61), and by Lemma 20, (5,61) H y. 
By Lemma 15 we have h = y(b1). Since y is A-restricted w.rt. {w1,...,W,}, we deduce 
by Lemma 10 that (5,5)  w. 


This leads to the main result of this paper: 
Theorem 31. The safe entailment problem is 2EXPTIME-complete. 


Proof. The 2EXPTIME-hard lower bound follows from [8, Theorem 32], as the class 
of progressing, A-connected and A-restricted entailment problems is a subset of the safe 
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entailment class. For the 2EXPTIME membership, Lemma 30 describes a many-one 
reduction to the progressing, connected and established class, shown to be in 2EXP- 
TIME, by Theorem 4. Considering an instance X = p Fg w of the safe class, Lemma 
30 reduces this to checking the validity of |D(@)| instances of the form 6’ H R Wy, that are 
all progressing, connected and established, by Lemma 27. Since a formula 6’ € D() 
is obtained by replacing each predicate atom p(x1,...,x,) of Ọ by px(%1,.--,%n,W) 
and there are at most 2” such predicate atoms, it follows that |D(o)| = 29(®)), To 
obtain 2EXPTIME-membership of the problem, it is sufficient to show that each of 
the progressing, connected and established instances 6’ + R W can be built in time 
[$B] -200(F)oew(B)) | First, for each ¢' € D(d), by Definition 22, we have |’| < |o|-(1+ 
v) < Jol- (1 + w(B)) = |g] -200°s“(B)), By Definition 17, we have || < lọ]: (1 +v) = 
|o|-200°eew(B)), By Definition 23, D(R) can be obtained by enumeration in time that 
depends linearly of 


[D(R)| <|RJ-2#-(n+v+p)Y < |R|- 2P HE) wE) — g]. 2008) 


This is because the number of intervals J is bounded by 2” and the number of substitu- 
tions © by (n +v +u)“, in Definition 23. By Definition 25, checking whether a rule is 
well-defined can be done in polynomial time in the size of the rule, hence in 2O(WB)) 
so the construction of R; takes time |P| -20(“(®) low) | Similarly, by Definition 23, 


the set R is constructed in time 
IR] < IR] 2 -WB < |R] 27 (BB) -2KPH IEW) = p). 200) 


Moreover, checking that a rule in R, is connected can be done in time polynomial in 
the size of the rule, hence the construction of R, takes time 20(w(B) logw(B)) Then the 
entire reduction takes time 20°(¥(P)to8w(¥)), which proves the 2EXPTIME upper bound 
for the safe class of entailments. 


5 Conclusion and Future Work 


Together with the results of [10,14,6,8], Theorem 31 draws a clear and complete picture 
concerning the decidability and complexity of the entailment problem in Separation 
Logic with inductive definitions. The room for improvement in this direction is probably 
very limited, since Theorem 31 pushes the frontier quite far. Moreover, virtually any 
further relaxation of the conditions leads to undecidability. 

A possible line of future research which could be relevant for applications would be 
to consider inductive rules constructing simultaneously several data structures, which 
could be useful for instance to handle predicates comparing two structures, but it is 
clear that very strong conditions would be required to ensure decidability. We are also 
interested in defining effective, goal-directed, proof procedures (i.e., sequent or tableaux 
calculi) for testing the validity of entailment problems. Thanks to the reduction devised 
in the present paper, it is sufficient to focus on systems that are progressing, connected 
and left-established. We are also trying to extend the results to entailments with formule 
involving data with infinite domains, either by considering a theory of locations (e.g., 
arithmetic on addresses), or, more realistically, by considering additional sorts for data. 
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Abstract. Subformula linking is an interactive theorem proving tech- 
nique that was initially proposed for (classical) linear logic. It is based on 
truth and context preserving rewrites of a conjecture that are triggered 
by a user indicating links between subformulas, which can be done by 
direct manipulation, without the need of tactics or proof languages. The 
system guarantees that a true conjecture can always be rewritten to 
a known, usually trivial, theorem. In this work, we extend subformula 
linking to intuitionistic first-order logic with simply typed lambda-terms 
as the term language of this logic. We then use a well known embedding 
of intuitionistic type theory into this logic to demonstrate one way to 
extend linking to type theory. 


1 Introduction 


Suppose you want to prove a conjecture such as: 


(We. Ay. a(f(x),y)) A (Wz. a(f(f(c)), z) D b(z)) > Au. b( f(u)) 


or to find replacements for the ?s that would allow a dependent type such as the 
following to be inhabited: 


Tu:(ITa:a. y:(ba). cay). Hv:(Hzx:a. bx). Twa. (ce? ?). 


In a mainstream interactive theorem proving system you would attempt it by 
giving instructions to a carefully constructed proof verification engine using a 
formal proof language, often with a read-eval-print loop for immediate feedback. 
Your instructions would guide the verifier through the twists and turns of a formal 
derivation until it is satisfied that all formal obligations have been established. 
Your language of instructions could be tactics-based (such as in Coq), or it could 
be a programming language itself (such as in HOL-Light or Agda); it could also 
have a formal structure or be declarative (such as Isabelle/Isar).' Despite these 
superficial differences, all such systems can broadly be called linguistic because 
the internal state of the verifier can only be modified by means of the formal 


1 These are just illustrative examples of mainstream proof systems and should not be 
read as assigning them a position of privilege or authority. 
© The Author(s) 2021 


A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 200-216, 2021. 
https: //doi.org/10.1007/978-3-030-79876-5_12 
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proof language (and the whims—or semantics, if you prefer—of the interpreter 
of the language). 

An alternative to such a linguistic system would be a system of direct manip- 
ulation, wherein there is a tangible representation of the state of the verifier that 
one can modify directly using such tools as one’s fingers, pointing devices, or eye 
movements. The verifier’s job is then to make sure that the direct manipulation 
attempts are allowed when they are logically permissible and prevented when 
they are not. A prominent example of such a direct manipulation system is the 
proof by pointing technique [3], where mouse clicks on the representation of a 
proof state (in a version of Coq) are given a meaning: a click on a connective deep 
in a formula is interpreted as a sequence of Coq tactics that bring the connective 
to the top, at which point it could be made to interact with the other hypotheses 
or the conclusion in the usual manner. 

A generalization of this idea, called proof by linking, was proposed in [4]. It 
allows the user not only to point but also to link different subformulas, say with 
a multi-touch input device or with a drag-and-drop metaphor. There are two 
immediate benefits of linking over pointing: (1) the surrounding context of a 
formula is not destroyed because the linked subformulas are not brought to the 
top, and (2) the interaction mode is easier to describe to complete novices. For 
instance, a novice could be instructed to “match the atoms” for the first example 
above, in which case they might start by attempting the following link: 


(Yz. Ay. a( f(x), y)) A (Wz. a(f(f(e)), 2) 2 b(z)) > du. b( f(u)). 
t 7 


The linking procedure would interpret this link as a desire to “bring” the 
source atom “to” the destination atom. Without touching any other part of the 
conjecture except the smallest subformula containing both the source and the 
destination of the link, the conjecture would be rewritten to a different one: 


Se. Wy. Y2. ( (ACFE) 0) > AEGEE) 2 W2) ) > Iu- DFC). 


The surrounding context of the link is preserved as nothing is brought to the 
top; instead, the source moves through the formula tree to meet the destination. 
The rewrites that underlie the transformation are provability preserving: if the 
rewritten conjecture is provable, then so is the original conjecture. Eventually, 
the conjecture (if true) would be reduced to a trivial theorem such as T. Note 
that the novice user does not need to know any proof language to draw these 
links, not even a conceptual proof system such as the sequent calculus. 

The original proof by linking technique was proposed for classical linear logic 
and freely exploited the calculus of structures [17]. In this paper we show how 
to adapt the technique to intuitionistic logics and intuitionistic type theories, 
where the calculus of structures is not so well behaved [18,8] (or, in the case of 
dependent type theory, entirely missing), and where preserving the context of 
the rewrites is a more delicate task. We do this by first defining the technique for 
intuitionistic first-order logic over \-terms, and then we use an existing complete 
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(shallow) embedding of dependent type theory in this logic [6,15]. A secondary 
contribution is to give some insight into what a deep inference formalism might 
look like for dependent type theory. 


2 Subformula Linking for Intuitionistic First-Order Logic 


This section will serve both as an introduction to the subformula linking procedure, 
and as evidence that the technique can be applied to intuitionistic logics. Let us 
do this in two phases: first for the the propositional fragment, and then extended 
with first-order quantification. 


2.1 The Propositional Fragment 


We will use the following grammar of formulas (written A, B,...), where atomic 
formulas are written in lowercase (a, b,...). 


A,B,...n=a|AAB|T|AvB|L|ADB 


Following usual conventions, the connectives A and V are left-associative, while 
> is right-associative; the binding priority from strongest to weakest is A, V, 5. 
The true formulas of this calculus can be defined in terms of derivability in a 
variety of formal systems such as with the sequent calculus LJ or G3ip [11]. In 
this paper the precise sequent calculus is not of primary concern; however, we 
will use the notation I’ = C where I is a multiset of formulas to denote that 
the formula C is derivable from the assumptions I" using any such calculus. 

A positively signed formula context (written C{}) is a formula with a single 
occurrence of a hole {} in the place where a positively signed subformula may 
occur; it is defined mutually recursively with an negatively signed formula context 
(written A{}) by the following grammar, where * € {A, v}. 


C{} s={} | A*C{} | C{}* B| A>C{}| A{}> B 
A{} == Ax A{}| A{}* B| ADA{}|C{}>B 


The replacement of the hole in C{} (resp. A{}) with a formula A yields a 
new formula, which we write as C{A} (resp. A{A}). For instance, if C{} is 
an ((bD {}) v d), then C{e> L} isaa ((bD (cd L)) vd). 


Theorem 1. Suppose that A / B. Then: 


— for any positively signed context C{}, it is the case that C{A} H| C{B}; and 
— for any negatively signed context A{}, it is the case that A{B} = A{A}. 


Proof. Induction on the structure of the contexts C{} or A{}. o 


In order to define the subformula linking procedure for this calculus, we work 
with interaction formulas; an interaction formula is a formula where: 
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Terminal rules 


C{T} C{AD B} 

Clara} © C{ApB}™ 

(the conclusion of rel is understood as not overlapping that of in) 
Positively signed rules 

C{(A> B) a F} r C{F A (A> B)} 

C{AP(BAF)} `  C{A»(FAB)} 

C{(A> B)A(F>B)} me C{(F > B) a (A> B)} 


BA 


C{(Av F)> B} Í CUF vA B} 
C{(Ao B)D F} C{FD(Ap B)} 
C{A> (B> F)}) * CA» (F>B)} * 
C{A> B} y C{A> B} E 
C{AP(BVF)} ` C{Ar(FVB)} ° 
C{A> B} A C{A> B} PA C{F aA (A> B)} 


1 


C{(AA F)> B} C{(F A A) > B} C{(F > A) > B} 


Negatively signed rules 
A{(Aco B) v F} oy A{F v (Ao B)} o 
A{Ao (BVF) `  A{Ao(FvB)} 
A{A ° B} A A{Ao B} O 
A{Ao(BAF)} `  A{Ao(F^B)} 
A{(A> B)> F} E A{FD(Ao B)} R 
A{A o (B> F)} AA o (F> B)} 


2 


^2 


1 2 


(plus all the symmetric variants) 


Fig. 1. Inference rules for interaction formulas 


— either a single occurrence of D is replaced with >, 
— or a single occurrence of A is replaced with o. 


We will define an inference system for interaction formulas that consist of 
inference rules with a single conclusion and a single premise, both of which 
are either formulas or interaction formulas. The inference rule represents an 
admissible rule of intuitionistic logic: if the premise is a theorem, then so is the 
conclusion. The full collection of rules is shown in fig. 1. There are three kinds of 
rules, explained below in an upwards (conclusion to premises) reading. 


— Terminal rules are used to terminate a >-interaction in a positively signed 
context. In the case where the P-interaction links two occurrences of the 
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Interaction creation rules Contraction 
C{Ap B} A{A 0 B} C{AD ADF} 
C{ADB}” AfAnB}° caor} 7" 
Simplification rules 
C{T} C{B} C{T} 
C{AD T} C{T > B} C{L> B} 
C{F} C{F} A{F} A{F} 
C{T AF} C{F AT} A{LvF} A{Fv 1} 
A{1} A{L} C{T} C{T} 


A{LAF} A{Fa L} C{TvF} C{Fv T} 


Fig. 2. Link creation, contraction, and simplification. The conclusion in each case must 
not be an interaction formula. 


same atom, the result is T; otherwise the > turns back into >. These are the 
only rules that can transition out of interaction formulas. 


— Positively signed rules operate on a -interaction in a positively signed 
context. The rules are written in fig. 1 in such a way that the subformulas 
A and B are brought together in the premise, and occurrences of F (if they 
exist) are side formulas. 


— Negatively signed rules operate on a o-interaction in an negatively signed 
context. Fig. 1 only shows one of the two symmetric variants for each case; the 
other variant is built by permuting A with B and transposing the operands 
of o. For instance, oV; has the following symmetric variant. 


A{(Ao B) VF} | 
A{(Av F) o B} 


ly 


We will use primes to systematically name the symmetric variants of rules. 


Proposition 2 (Soundness). Interpreting > as D and o as A, each rule of 
fig. 1 with premise P and conclusion Q has the property that P FQ. 


Proof. Straightforward consequence of theorem 1. oO 


Two further administrative steps remain to complete the technique. First, 
since the rules of fig. 1 always contain an interaction formula in the conclusion, 
we need to add some rules that can conclude ordinary (non-interaction) formulas. 
Since we read each inference rule from conclusion to premise, we will call these the 
interaction creation rules, which are shown in the first part of fig. 2. To incorporate 
non-linearity, we add a separate contraction rule; this keeps the interaction 
creation rules simple, but it needs to be explicitly invoked. These interaction 
creation rules are obviously sound under the interpretation of proposition 2. 
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ad(avaaTaAT) 


ne 
ad(adadrT Acwe) 
3Px3 


>D3X2 


x 

: simplification 
TATATAT 

TAararTaT 


ab(TAaATAT) 
—_—_—— SaaS SO 
a>(TAG@ATAT) 


ad(abadadATAT) 


PAG GD | DOVeA 
CEEE in 


(adadTIc)IaArE DAL X2,PA2 
7 


(adadbebIc)Iardc 


(ada d(bIc)ob)J arc in 
(a> (adbdc)ob)daDc DALX3 
(ad bdc)o(adb)DarDc adar(ahadT aT) 
(GDH OVE (GDh) Da >a 


adad(@AaATAT) 
(ad bIc)I(adb)dardc > 


cont 
aDļananTaT) 


Fig. 3. Lnip derivation fragment for the S-combinator 


The final step is to detect when a proof is complete. Since every inference 
rule presented so far has a single premise, we will say that a proof is complete 
when the final (again reading bottom to top) premise is, effectively, T. What 
do we mean by “effectively”? One candidate definition could be that a purely 
algorithmic procedure can detect when a proof is finished in linear time. For 
instance, we can say that a proof is complete if its premise can be established 
using only the simplification rules shown in the second part of fig. 2. These rules 
may be applied in any arbitrary order and at any time. An implementation of 
the technique may choose to apply these simplification rules on the fly. 


Definition 3. The collection of rules in figures 1 and 2 will be known as the 
proof system Lnip. If A and B are formulas or interaction formulas, we write 
A +22, B to mean that either A = B or there is an Lnip derivation where the 
topmost rule has premise A and the bottom-most rule has conclusion B. oO 


Theorem 4 (Completeness of Lnip). If H F, then T 42 F. 


Proof (Sketch). There are many ways to prove this, both syntactic and semantic. 
An instructive syntactic proof goes as follows. For a small variant of the G3ip 
sequent calculus [11], we show that every inference rule is admissible in Lnip 
under a suitable formula interpretation of sequents. Thus, any sequent proof is 
recoverable in terms of Lnip inferences. We then just appeal to completeness of 
the sequent calculus. oO 


Example 5. A Lnip derivation of the S-combinator formula, (abDc)3(adb)D adc, 
is shown in fig. 3. The interaction connectives > and o take the precedence and 
associativity of D and A respectively. The locus where a Lnip rule is applied is 
depicted with a highlight. Of course, the S-combinator formula cannot be proved 
without appealing to contraction at least once, which is seen by the appeal to 
cont in the derivation. 
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An extremely interesting aspect of this example Lnip derivation is that it 
begins by considering the first two assumptions, (a D b D c) and (a D b), of 
the S-combinator formula. The user might have indicated this consideration by 
drawing a link between the two occurrences of b, highlighted in orange and blue 
in fig. 3. The effect of this consideration is to perform a “composition” of the 
two assumptions into the stronger assumption (a D a> T Dc), which could of 
course have been simplified to (a D a D c) immediately. In shallow proof systems 
such as the sequent calculus or natural deduction this kind of compositional step 
cannot be taken as such, and would require cuts or lemmas. 

As explained in the introduction, this kind of composition might have been 
discovered in the process of exploration by the simple strategy of drawing a link 
between the two occurrences of b. Such a link is legal because in the common 
context that contains both occurrences of b, their ancestral connective is D, which 
can be turned into a > interaction using the > rule. Once these two occurrences 
are linked, we can interpret the interaction rules (fig. 1) as trying to bring the 
two ends of the link closer. Indeed, in each of the rules of fig. 1, we can say that 
one of the ends of the link is in the formula A and the other is in the formula B. 
We are therefore ready to formulate the linking procedure. 


Definition 6 (Subformula Linking Procedure). Repeat the following se- 
quence of steps until the conjecture formula (i.e., end-formula) F is transformed 
to T (success), no fruitful progress can be made (failure), or the proof attempt is 
aborted by the user. 


BR 


. (Optional) Ask the user to indicate negatively signed subformulas of F that 
need to be contracted using the cont rule. 


2. Ask the user to indicate two different subformulas of F; this is the link. 


3. If the first common ancestor connective of the two linked subformulas is a D 
that occurs in a positively signed context, use the > rule to turn it into a >; 
likewise, if the ancestor is a A in an negatively signed context, use the o rule 
to turn it into a o. If neither case applies, then the user indicated an invalid 
link, so we return immediately to step 2. 

4. Use the interaction rules (fig. 1) in such a way that the endpoints of the link 
stay in the same interaction from conclusion to premise. 

5. Eventually, one of the terminal rules in or rel will be applicable to remove 
the interaction; at this point we say that the link is resolved. 

6. After resolving a link, the simplification rules may be applied eagerly in an 
arbitrary order. 


The most important step in the inner loop of the procedure is step 4. The 
rules for interaction are not unambiguous because the conclusions of different 
rules can overlap. Let us start by examining the positively signed rules; as an 
example, consider the interaction C{(F > A) > (GD B)}, with the understanding 
that the endpoints of the indicated link in step 2 are present in A and B. There 


Linking for Intuitionistic Logic (and Type Theory) 207 


are two possible ways to resolve this link: 
C{F \(G > (Ap B))} = C{GD(F^ (A> B))} = 
C{FA(Ap(G2B))} ° C{G>((F2A)p B)} 
C{((FDA)>(G2DB)}~  ~=C{(F> A)» (G> B)} 


Does the choice matter? Yes, because the formulas F A (G> H) and GD (FA FH) 
are not intuitionistically equivalent; indeed, the former strictly entails the latter. 
Hence, one of the two alternatives produces a strictly stronger—and potentially 
unprovable!—premise. Which one should the procedure pick? 

This ambiguity also existed in the original formulation of the formula linking 
procedure for classical linear logic [4], and we can use the same answer used in 
that work. The key insight is that many of the ambiguous cases can be resolved 
by a simple analysis of polarities. A detailed discussion of polarity (and the 


2 


oft-associated focusing discipline [1]) is not relevant to this work, however.” We 
will instead just use the observation that some of the interaction rules of fig. 1 
are asynchronous, meaning that the premise of the rule is equiderivable as the 
conclusion—assuming we replace > and o with D and A respectively—while 
other rules are synchronous, which means that the premise strictly entails the 
conclusion. For the specific example above, the PD, rule is asynchronous, because 
the order of assumptions in an implication is immaterial (at least in intuitionistic 
logic), while the Dp rule is synchronous since its conclusion cannot justify the 
premise. We can draw up this table for all the positively signed rules. 


asynchronous rules: PA 1, PA g, VP, VP2, >D], PDe 
synchronous rules: PV,, >V2, APy, AP2, DP 


Whenever there is a choice between a synchronous and an asynchronous rule 
to apply first (reading from bottom to top), we should pick the asynchronous rule, 
since that does not destroy derivability. If we have a choice of two asynchronous 
rules, then the choice is immaterial, as derivability is preserved regardless; the 
procedure can pick arbitrarily. Different choices would just lead to associative- 
commutative variants of the same ultimate premise. Finally, for a choice between 
two synchronous rules, we can consider all such pairs from the table above to see 
that the choice is immaterial: all choices have the same result. 

The story is not quite as simple for the negatively signed rules of fig. 1, where 
every single rule would be synchronous by our definition. Unlike in the positively 
signed case, here we have a critical pair. 


A{(F 3 (Ao B)) v G} a. A{F (Ao B) vG)} bys 
A{((F'3 A)o B) v G} ey, A{F3(Ao(BvG))} oy 
A{(F > A)o(BvG)} A{(FD A) o(BvG)} 

As before, the premises are not equiderivable. Resolving this ambiguity is going to 


be as hard as fully automated proof search, which will therefore not be recursively 


? Our choice of connectives here has only negative polarity connectives except 4 and 
v. In intuitionistic logic it is also possible to have a positive A and atoms of both 
polarities [5,10], but this generality is not necessary for the present work. 


208 K. Chaudhuri 


Terminal rules . 
C{s=t} 
== in 
C{a-spa-t} 
Quantifier rules 


C{Yzx. (A> B)} C{3x. (A> B)} A{Yy. (A 0 B)} 
C{Ap Wx. B} ” C{(Wx. A)> B} ? Af Ao Wy. B} 
C{3y. (A> B)} C{Yx. (A> B)} Af{ay.(A 0 B)} 


o 


C{Ap Jy.B} ” C{(ax.A)> B} © AMA Jy. B} 
(in each rule, r#B and y#A) 


Simplification and instantiation rules 


= refl cong 


C{T} C{T} C{s=t} C{t term} C{[t/x]A} | 
C{Va. T} C{x = x} Cf{f-s = f-t} C{Az. A} 


Fig. 4. System Lni: rules for quantifiers and terms 


solvable as soon as we introduce quantifiers. The subformula linking procedure 
needs further guidance from the user to resolve the ambiguity. A variant of this 
ambiguity can also be found in the original subformula linking work for classical 
linear logic [4]; there, the solution was to make the links directed. Then, whenever 
there is a choice to be made—which will necessarily have to be a choice between 
one subformula containing the source of the link and the other containing the 
destinattion—the procedure can choose to perform the rule corresponding to the 
destination first. In the above critical pair, for instance, if A contained the source 
and B the destination, then we would perform the ov, step first (i.e., follow the 
left derivation). This choice is made to evoke the intuition that the source is 
brought to the destination; the context of the destination swallows the context of 
the source. 


Definition 7 (Directed Subformula Linking Procedure). We modify the 
procedure of definition 6 by making the links in step 2 directed, and in the 
resolution step 4 we break synchronous/synchronous ties for negatively signed 
rules by performing the rule for the destination first. 


2.2 Quantifiers 


Extending Lnip with first-order quantifiers can be done in a number of ways. 
Here we present a parsimonious extension that avoids any up front commitments 
with regard to the strength of the term language. Our terms (written s,t,...) 
have the following grammar: 
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where we write s to stand for a list of terms [s1,52,...,5,]. We use 2,y,... 
to range over variables and f,g,... to range over function symbols, and we 
abbreviate f-[] to f. We also extend atomic formulas: they are now written a-s 
where a is a predicate symbol, and we again abbreviate a-[] to a. To formulas 
and contexts we now add the two quantifiers, V and d, to give the following 
extended grammars, where * € {A, V} and Q € {V, J}. 


A,B,...u=as|AANB|T|AVB|L|ADB|Vz.A| Av. A 
C{} z= {} | A*C{} | C{} * B | Qz.C{} | ADC{}| A} > B 
A{} ::= A x A{} | A{} * B | Qx. Af} | AD Af} | C{} > B 


We write C{t term} to assert that the term t is well-formed for the hole in 
C{}, i.e., all the (free) variables of t are bound by some quantifier that the hole 
in C{} is in the scope of. We also write x#t or x#A to indicate that the variable 
x is not free in t or A respectively. Finally, the capture-avoiding substitution of 
t for x in a term u or formula A is written [t/x]u or [t/x]A respectively. The 
replacement of formulas in contexts, on the other hand, is not capture-avoiding 
C{A}; instead, this replacement is considered to be well-formed whenever every 
free variable x of A has the property that C{x term}. 

In order to give ourselves maximum freedom in the definition of the first-order 
extension, we will use the additional binary predicate symbol = to denote equality. 
Given two lists of terms s = [s,...,8,] and Ë = [t,,...,t,] of equal length, we 
will write $+ f to stand for (s;=t,) A + A (sn =t,) if n > 0 and for T otherwise. 
Using this additional predicate, the terminal rule in of Lnip is modified to account 
for the term arguments. 


Definition 8 (System Lni). The system Lni is an extension of Lnip by removing 
the in rule of Lnip and adding the rules of fig. 4. 


Theorem 9 (Completeness of Lni). If = F in a complete sequent calculus 
for first-order intuitionistic logic (e.g., G3i [11]) then T #5 F. 


Proof (Sketch). We can follow the same strategy as for theorem 4. Note that for 
any term t, the rules refl and cong suffice to reduce C{t=t} to C{T}. A transitivity 
rule for = is not needed: no = is created in an negatively signed context. oO 


Example 10. Two example Lni derivations are shown in fig. 5. 


(a) This is a derivation for a provable formula where the user may have linked 
the two occurrences of a. Observe that the simplification rules {cong, inst, 
refl} help to implement first-order unification under a mixed quantifier prefix. 
However, since Lni simplification rules can be applied at any time, we can 
solve unification problems incrementally, in tandem with logical reasoning. 

(b) This is a derivation for an unprovable formula containing an illegal quantifier 
exchange, where once again the indicated link is between the two occurrences 
of a. This derivation cannot be completed because there is no instantiation 
for x for which Vw. x = w is true. 
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1 


Yy. (T AT) 
vy. (f-[c]=fl[c]Ay=y)  . i 


Vy. 3z. (f-[c]=f[c]Ay=z) | : 


congxX2,refl 


Ax. Vy. 3z. (x = f-[c] Ay =z) Ag. Vy. 3z. Yw. (x =w A y= z) . 
da. Vy. 3z. (f-[x] = f-[f-[c]] A y = z) ie Jx. Vy. 3z. Yw. (a-[x, y] > a-[w, z]) n 
ax. Vy. 3z. (a-[f-[x], y] > a-[f-[f-[c]], 2]) > ax. Yy. dz.(a-[x, y] > (Vx. a-[x, z])) f 
Agr. Vy. (a-[f-[x], y] > (Az. a-[f-[f-[c]], z])) J Jz. Vy. emee J 
ax. (Ay. a-[f-[<], y]) > (Az. a-[f-[f-[c]]; 2) "i Ax. ((3y. a-[x,y]) > (Ay. Va. a-[x, y])) “ 
(Va. dy. a-[f-[x], y]) > (Az. a-[f-[f-[e]], 2]) > (Va. dy. EEA (Ay. Vax. a-[x, y]) 5 
(Yr. 3y. a-[f-[x], y]) > (32. AE) (Wa. dy.a-[x,y]) > (dy. Va. a-[ax, y]) 


(a) (b) 


Fig. 5. Two example Lni derivations 


3 Incorporating Arity-Typed A-Terms 


To make the calculus Lni of the previous section suitable to host a type theory 
as an object language, we will need to generalize from first-order terms to 
general A-terms. We will follow a standard technique known variously as higher- 
order abstract syntax (HOAS) [12] or A-tree syntax [7] that treats the pure 
A-calculus—together with afn-equality as its equational theory—to represent 
object languages. To keep things computable, we will use simply typed A-terms 
with only one basic type, which is sometimes known as arity typing. Arity types 
(a, 8,...) and terms (s,t,...) have the following grammar. 


a, B,...u=* |a >b hu=al|k s,t,...u=hes | Azta. t 


where x,y,... range over variables, and sans-serif identifiers such as k range over 
term constants. For formulas, we also change the quantifiers Qa. F to their arity 
typed forms Qa:a. F, where Q € {V, 3}. 

We keep A-terms in canonical spine form, where the head (h) of an application 
is identified and separated; in more usual notation, h-[s,,..., Sn ] would be written 
as the iterated application (---(h s1) +++ sn). The definition of substitution, [t/x]s, 
must be modified to retain spine forms, which is usually done by removing redexes 
on the fly; for example (using @ as an auxiliary operation): 


[t/a]k =k-[] [t/r]2=t [t/x]y =y-[] (where z and y are different) 


[t/@](Ayia. s) = Ayia. [t/x]s 
[t/a] (h-[s1,..-,8n]) = ([t/x]h) @[[t/z]s1,..., [t/t] sn] 
(Aria. s) @[t,, to,...,t,] = ([t1/x]s) @ [t2,... tn] 
(h-[s1, oe -,8m]) @ [t1, eee stn) = h-[s1, oe esmi tlre oe stad 
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Most of the inference rules of system Lni generalize easily to this setting. The 
immediate differences will be with respect to the simplification rules. For the inst 
rule, we use a variant judgement C{t: a} to mean that the \-term t is well-typed 
at type a based on the type assumptions of its free variables that are bound 
in the scope of the hole in C{}. It is possible to view this judgement as being 
defined by inference rules; for instance (for Q € {V, 5}): 


C{Vaia.(t: B)} 
C{Qaia.C'{x > aS} C{(Aria.t) : a > b} 
C{h : ay > e > An — B} Cis. : ai} 
C{(h-[s1, e.’ Sn]) : B} 


The rules refl and cong of Lni are replaced with: 


C{s=t} h C{Vaia. (s = t)} 
C{h-s = ht} re Olay = Cua) 


C{(Ania. he[51,..-,8n,2]) = (Aria. t)} 
Cf{h-[s1,---,;8n] = (Avia. t)} 


Definition 11 (System Lnià). The system Lnid is a modification of Lni with 
the V rules, cong, abs, n-exp, and in above. 


n-exp (and its symm. variant) 


Theorem 12 (Completeness of Lnià). For any formula F in the language 
of first-order logic over A-terms but without any occurrence of =, if F F ina 
complete sequent calculus then T 424 F. 


Proof (Sketch). Once again, this is a straightforward extension of the proof 
of theorem 9. Since there are no occurrences of = in F, and in particular no 
occurrence of it in a negatively signed context, the rules cong, abs and 7-exp are 
sufficient to implement a[7-equivalence. oO 


4 Application: Embedding Intuitionistic Type Theories 


The first-order language over arity-typed A-terms of the previous section has 
enough expressive power for a complete encoding of any pure type system [6,15]. 
To keep things simple in this paper, we will demonstrate the case for LF (aka 
AIT) using the simple embedding from [15]. Expressions in LF belong to one of 
the following three syntactic categories: kinds, types, or terms. 


K ::= type | H:A. K (kinds) 
A, B,...n=aM, + Mn | Hz:A. B (types) 
M,N,...n=a|k|Av:A.M|MN (terms) 


The LF type system is formally specified using inference rules in [9] and will not 
be repeated here. Instead, we will directly present a complete encoding of LF 
expressions using the language of Lnià. 
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The encoding proceeds in two steps. First, we transform the dependently 
typed terms of LF into their simply typed forms, normalizing them as necessary. 
However, since LF terms can mention their types, we simultaneously transform LF 
types into simple types. This transformation erases not just the type dependencies 
but also the identities of the types by collapsing all of them to the same base 
type *. 


Definition 13. The forgetful map @ specified below transforms LF terms into 
Lnià r-terms and LF types and kinds into LniX types. 


o(k) = k-[] ọla Mı = Mn) = * 
d(x) = x-[] ọ(lIzx:A. B) = $(A) > 6(B) 
o(M N) = o(M) @[4(N)] o(Ir:A. K) = o(A) > o(K) 


The second stage of the transformation recovers the information that was lost 
in the ọ map by means of one atomic propositions, has. Using this we define a 
mapping [] that transforms types and kinds to formulas in such a way that if 
M : A holds then [ A] ¢(M) is true. 


Definition 14. The mapping [] transforms an LF type/kind and a LniX \-terms 
into a Lnià formula, specified recursively as follows. 


ja Mı --- M,,]/m = has-[m, a-[6(M1),...,6(M,)]] 
[type]]m = has-[m, type] 
[a:A. J]m = V2:¢(A). [A]z > [J] (m @ [2]) 


(where J can be a LF type or kind). 


Proposition 15 (Completeness [15]). If the judgement x1:J1,...,UntJn F 
M:A is derivable in LF [9], then the following formula is provable in Lnià: 
Yar). [Ai] (e1-[]) > 3 Yeno). In (@n-[]) > TAIM). o 


The converse of proposition 15 does not necessarily hold, since the forgetful 
map ¢ģ¢ is injective, not surjective.” In particular, since the encoding of atomic 
types forgets the term arguments, we have that @(Arv:A,.s) = (Av: Ag. s) if 
@(A,) = (A2); however, the latter does not guarantee that A, = Az. Thus, 
| 7x: A,. B]@(Ar: Ag. 8) may hold even when A; + Ag. To guarantee surjectivity, 
we must use the canonical LF variant of the LF type theory where the type 
ascription on À is omitted and the type system is made bidirectional [19]; this 
will guarantee that only [-types will ascribe types to bound variables, removing 
the issue highlighted above. 


3 This issue, pointed out in [16], is a mistake in earlier papers such as [6,15]. 
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Vu. Vz. Jk. z-Lu] =k 
Vu. Vz. Jk.u = u A z[u]= k ^ b-[u] =b-[u] 
Vu. Yz. Jk. Jx.u =x A z[x]= k A b-[x] = b-[u] 
Vu. Wz. dk. Jx. u = x A (has-[z-[x], b-[x]] > has-[k, b-[u]]) 
Vu. Vz. dk. (Yx. u = z D has:[ z- [x], b-[x]]) > has-[k, b-[u]] 
Vu. Vz. Jk. (Vz. u + z o eee) > a 
Vu. Wz. Jk. (Wa. (u =g Aa =a) D has-[z-[x], b- [x]])> has-[k,b-[u]] 

Vu. Vz. Jk. (Wa. (has-[u,a] > has-[a,a]) > has-[z- [x], b-[x]]) > has-[k, b-[u]] 
Vu. Vz. dk. (Yx. has-[u, a] o has-[x,a] > has-[z-[x£], b-[x]]) > has-[k, b-[u]] 
Vu. Vz. dk. has-[u, a] o (Wax. has-[x, a] > has-[z-[a], b-[x]]) > has-[k, b-[u]] 
Vu. Wz. Jk. has-[u, a] > (Wax. has-[x, a] > has-[z-[a], b-[x]]) > has-[k, b-[u]] 
Vu. Wz. has-[u,a] > (Wa. has-[z,a] > has-[z-[x], b- [x]]) > Ak. has-[k, b-[u]] 
Vu. has-[u, a] > Vz. (Wax. has-[@,a] > has-[z-[x], b- [x]]) > Ak. has-[k, b-[u]] 
Vu. has-[u,a] > Vz. (Wa. has-[x,a] > has-[z-[x], b- [x]]) > 3k. has-[k, b-[w]] > 


inst[u/zx] 


Fig. 6. A Lnià derivation of an embedded LF type (example 16). Some type ascriptions 
are elided, and doubled lines denote simplifications. 


Example 16. Consider the following LF type A = HMu:a. Hz:(Hx:a. bx). bu. By 
definition 14, we have: 


JA]k =Vu:x.has-[u,a] > 
YVzix > x. (Vz:x. has- [x,a] > has:[z- [x], b- [x]]) > 
has-[k, b-[u]]. 


Fig. 6 has an example Lnià derivation of this formula where k is existentially 
quantified. As usual, highlights are used to indicate the two links the user 
indicated in the two > rules. The derivation can be complete with the instantiation 
[z-[u]/k]; this means that the LF type A is inhabited by some LF term M for 
which @(M) = z-[u]. 


Note that the fact that we have not discovered a LF term for k using the LniX 
derivation is not a problem. Given a Lnid term k for which [ A]k is derivable, it 
is possible to find a term M for which ọ¢( M) = k and M : A holds in LF. One 
way to do this would be to use bidirectional type checking [14,19] to recreate— 
deterministically—the missing LF types. 

While the encoding of LF in LniA suffices to implement the proof by linking 
technique, it is a leaky encoding. As the derivation in fig. 6 proceeds, the conjecture 
resembles the image of the []] map less and less; in particular, the conjecture starts 
to accumulate things that are not fundamentally present in the LF type system, 
such as term equations, conjunctions, and existential quantifiers. The purported 
novice user mentioned in the introduction thus needs to be familiar with at least 
two languages: LF and (a somewhat esoteric variant of) first-order logic. One way 
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to improve matters would be to try to define the linking procedure directly on 
the LF type system, but this example seems to indicate that the LF language is 
not expressive enough to capture all the structures that will occur when resolving 
a link. At the very least, it seems that some kind of pairing construct—i.e., 
X-types—is essential. Moreover, to capture free floating has assumptions, the 
language of LF might need to be extended further with judgemental expressions 
of the form (M:A). 


5 Conclusion and Future Directions 


We have presented a formal system of proof by linking for intuitionistic logic 
and a derived system for the dependent type theory LF. We are currently in the 
process of implementing this system as a variant of the Profound tool, which was 
initially developed for classical linear logic in [4]. 

In order for this system to be usable in a general purpose interactive theorem 
prover based on first-order logic (such as Abella [2]) or dependent type theory 
(such as Twelf [13]), the most important missing ingredient is support for inductive 
definitions and reasoning by induction. The first step in a proof by structural 
induction is to indicate which assumption(s) will drive the analysis, which is 
closer to a pointing than a linking. Thus, proof by linking and pointing will need 
to co-exist. 

A further improvement that would be made as a matter of course in an 
implementation would be the use of a unification engine to remove the clutter 
of = formulas. It is worth investigating (in future work) if the linking metaphor 
can also be used for algebraic operations on terms based on =. In many systems 
+-assumptions can be used to rewrite terms, which is readily incorporated into 
the linking scheme: just link a term to one side of a =. We can in fact see it as 
variants of the inst rule: 


C{[t/x]C'{T}} A{[t/2]A'{T}} 
C{az.C'{x=t}}  A{Wx. A'{x = t}} 


It is worth investigating if such variants of inst can make the embedding of LF 
into LniA less leaky. 

Note that proof by linking, like proof by pointing, can easily be incorporated 
as a tactic in an existing proof system. After all, each of the inference rules of 
Lnià is logically motivated, and can therefore be established as a certifying tactic. 
The quality of the formal proof terms produced in this way will be poor since 
most proof term languages are not designed for deep rewriting — indeed, the proof 
term for each LniX inference rule may have a size that is exponential in that of 
the conjecture. It is perhaps better to see proof by linking as a proof exploration 
tool for quickly testing out logical properties of a conjecture before attempting 
a traditional structured proof. In the hands of an expert user, this exploration 
mode can also help to discover useful lemmas to bridge the gap between an 
existing collection of proved theorems and a desired target theorem. 
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Abstract. We present an efficient proof search procedure for Intuition- 
istic Propositional Logic which involves the use of an incremental SAT- 
solver. Basically, it is obtained by adding a restart operation to the sys- 
tem intuit by Claessen and Rosén, thus we call our implementation 
intuitR. We gain some remarkable advantages: derivations have a simple 
structure; countermodels are in general small; using a standard bench- 
marks suite, we outperform intuit and other state-of-the-art provers. 


1 Introduction 


The intuit theorem prover by Claessen and Rosén implements an efficient 
decision procedure for Intuitionistic Propositional Logic (IPL) based on a Sat- 
isfiability Modulo Theories (SMT) approach. Given an input formula a, the 
clausification module of intuit computes a sequent o = R, X => g equivalent 
to a with respect to IPL-validity, where R, X and g have a special form: R is 
a set of clauses, X is a set of implications (a + b) > c, with a, b, c atoms, g 
is an atom. The decision procedure at the core of intuit searches for a Kripke 
model K such that at its root all the formulas in R and X are forced and g is 
not forced; we call K a countermodel for o, since it witnesses the non-validity 
of o in IPL. The search is performed via a proper variant of the DPLL(7) pro- 
cedure [12], whose top-level loop exploits an incremental SAT-solver. This leads 
to a highly performant decision strategy; actually, on the basis of a standard 
benchmarks suite, intuit outperforms two of the state-of-the-art provers for 
IPL, namely fCube [5] and intHistGc [IT]. At first sight, the intuit decision 
procedure seems to be far away from the traditional techniques for deciding IPL 
validity; on the other hand, the in-depth investigation presented in [10] unveils 
a close and surprising connection between the intuit approach based on SMT 
and the known proof-theoretic methods. The crucial point is that the main loop 
of the decision procedure mimics a standard root-first proof search strategy for 
the sequent calculus LJT sar (see Fig. F), a variant of Dyckhoff’s calculus 
LJT [8]. In [10] the intuit decision procedure is re-formulated so that, given a 
sequent g, it outputs either a derivation of ø in LJTsar or a countermodel for ø. 

Here we continue this investigation to better take advantage of the interplay 
between the SMT perspective and proof-theoretic methods. At first, we have en- 
hanced the Haskell intuit codd] by implementing the derivation /countermodel 


1 Available at https ://github.com/koengit/intuit 


© The Author(s) 2021 
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extraction procedures discussed in [L0]. We experimented some unexpected and 
weird phenomena: derivations are often convoluted and contain applications of 
the cut rule which cannot be trivially eliminated; countermodels in general con- 
tain lots of redundancies. To overcome these issues, we have redesigned the deci- 
sion procedure. Differently from intuit, in the main loop we keep all the worlds 
of the countermodel under construction. Whenever the generation of a new world 
fails, the current model is emptied and the computation restarts with a new it- 
eration of the main loop. We call the obtained prover intuitR (intuit with 
Restart). We gain some remarkable advantages. Firstly, the proof search proce- 
dure has a plain and intuitive presentation, consisting of two nested loops (see the 
flowchart in Fig. B). Secondly, derivations have a linear structure, formalized by 
the calculus C? in Fig. [] basically, a derivation in C? is a cut-free derivation in 
LJTsar having only one branch. Thirdly, the countermodels obtained by intuitR 
are in general smaller than the ones obtained by intuit, since restarts cross out 
redundant worlds. We have replicated the experiments in [2] (1200 benchmarks): 
as reported in the table in Fig. [9Jand in the scatter plot in Fig. intuitR has 
better performances than intuit. The intuitR implementation and other addi- 
tional material (e.g., the omitted proofs, a detailed report on experiments) can 


be downloaded at https://github.com/cfiorentini/intuitR 


2 Preliminary Notions 


Formulas, denoted by lowercase Greek letters, are built from an infinite set of 
propositional variables V, the constant L and the connectives A, V, —; the 
formula a © 8 stands for (a + 8) A (8 —> a). Elements of the set V U {L} 
are called atoms and are denoted by lowercase Roman letters, uppercase Greek 
letters denote sets of formulas. A (classical) interpretation M is a subset of V, 
identifying the propositional variables assigned to true. By M |= a we mean 
that a is true in M; moreover, M = I iff M — a for every a € I. We write 
I Fe a iff, for every interpretation M, METI implies M } a. A formula a is 
CPL-valid (valid in Classical Propositional Logic) iff Ø He a. 

A (rooted) Kripke model for IPL (Intuitionistic Propositional Logic) is a 
quadruple (W, <,7r, V) where W is a finite and non-empty set (the set of worlds), 
< is a reflexive and transitive binary relation over W, the world r (the root of 
K) is the minimum of W w.r.t. <, and 0: W m 2Y (the valuation function) is 
a map obeying the persistence condition: for every pair of worlds wı and w2 of 
K, wı < w2 implies J(w1) C V(w2). The valuation V is extended into a forcing 
relation between worlds and formulas as follows: 

w lF p iff p € U(w), Yp E€ V wit L wl- aA 6 iff w lF a and w IF 8 

wl-avßbifwlkaorwl- 8 wl- a—> 8 iff Yw > w, w lI- a implies w’ IF 8. 


By w I- I we mean that w IF a for every a € I’. A formula a is IPL-valid iff, 
for every Kripke model K we have r I- a (here and below r designates the root 
of K). Thus, if there exists a model K such that r lf a, then a is not IPL-valid; 
we call K a countermodel for a, written K jÆ a, and we say that a is counter- 
satisfiable. We write I’ H; 6 iff, for every model K, r I- I implies r I- 6; thus, 
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(a> b)>cEXx 
cpl, ACV 
y = NA\ {a} >c 


ie ee R, Ateb Ro X>g 
RX>g P’ R,X=>9 


Fig. 1. The sequent calculus C~; R, X => g is an r-sequent. 


a is IPL-valid iff Ø F; a. Let o be a sequent of the form I > 6; ø is IPL-valid 
iff Ir fy 6. By K j o we mean that r lk I and r IF 6. Note that such a model 
K witnesses that ø is not IPL-valid; we say that K is a countermodel for o and 
that o is counter-satisfiable. 


Clausification We review the main concepts about the clausification procedure 
described in 2]. Flat clauses p and implication clauses À are defined as 


p = AA, >VA2| VA Oc Ak C VU{L}, for k € {1,2} 
à = (a>b)>c¢ a€V, {b,c} CVU{L} 


where A A; and V Ag denote the conjunction and the disjunction of the atoms 
in A; and Ag respectively (A{a} = V{a} = a). Henceforth, A —> V Az must 
be read as V Ap; moreover, R, Ri, ...denote sets of flat clauses; X, X1, ...sets 
of implication clauses; A, A1, ...sets of atoms. The intuit procedure relies on 
the following property (see Lemma 2 in [10]): 


Lemma 1. For every set of flat clauses R and every atom g, R Fi g iff R Fe g. 


In the decision procedure, flat clauses are actively used only in classical rea- 
soning. A pair (R, X) is +-closed iff, for every (a > b) > c € X, b > c€ R. An 
r-sequent (reduced sequent) is a sequent l = g where g is an atom, l = RUX 
and (R, X) is >-closed. Given a formula a, the clausification procedure yields a 
triple (R, X, g) such that R, X = g is an r-sequent and: 


(1) Fi aif R,X ki g; (2) K A R,X > g implies K fF a, for every K.P] 


Thus, IPL-validity of formulas can be reduced to IPL-validity of r-sequents. 


3 The Calculus C? 


The sequent calculus C? consists of the rules cplg and cpl, from Fig. |1| Rule 
cpl, (axiom rule) can only be applied if the condition R He g holds, rule cpl, 
requires that R,A He b holds. In rule cpl,, (a > b) > c is the main formula 
and A the local assumptions; note that A is any set of propositional variables 
(not necessarily containing a). Derivations are defined as usual (see e.g. [14]); 


? In [2] the clausification procedure outputs a triple (R, X, g) satisfying (1) and (2); 
the —-closure of (R, X) is performed at the beginning of the decision procedure (for 
every (a > b) > c € X, the clause b > c is added to R). 
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Rm Fe g 
Rm-1, Am-1 Fe bm—1 Ring X> g À 
Rm1, X >g = 
Ri, Ai Fe by R2, X >g A 
Ro, Ao Fe bo Rı, X >9g A i 
0 


Ro, X > g 
Ak = (ak > bk) > cr E X, oe = N(Ak \ {ak}) > ck, Rati = ReU {9r} 


Fig. 2. Derivation of Ro, X > g in C? (0<k<m-1). 


by Fo o we mean that there exists a derivation of the r-sequent ø in C? . In 
showing derivations, we leave out rule names and we display the main formulas 
of cpl, applications. Soundness of rule cpl, relies on the following property: 


(a) If R,A Fe b, then R, (a > b) > c Fi p, where y = N(A \ {a}) > c. 


Indeed, let R, A Fe b. By Lemma [I] R, A Fi b, thus R,A\ {a} Fi a > b. 
It follows that R, (a —> b) > c, A\ {a} Fi c, hence R, (a —>b)—>c Fi y. By 
Lemma] and (a)| the soundness of C~ follows: 


Proposition 1. kos R,X => g implies R,X Fi g. 


A derivation of oo = Rọ, X = g has the plain form shown in Fig. it 
only contains the branch of sequents op = Rk, X = g where the sets Rk 
are increasing. Nevertheless, the design of a root-first proof search strategy for 
C?” is not obvious. Let go be the r-sequent to be proved; we try to bottom- 
up build the derivation in Fig. |2| by running a loop where, at each iteration 
k > 0, we search for a derivation of gp. It is convenient to firstly check whether 
Re Fe g so that, by applying rule cpl), we immediately get a derivation of 
ox. If this is not the case, we should pick an implication A; from X and guess 
a proper set of local assumptions A; in order to bottom-up apply rule cpl,. 

Bibe & by R, X >g If we followed a blind choice, the 
ReX Sý Ak pina se ieee Bm 
cient; for instance, the application 
Ak = (ax — bk) + ch E X, be + ce € Re of rule cpl, shown on the left trig- 
Ar = {bk}, Pk = be > ck, Repi = Re gers a non-terminating loop. In- 
stead, we pursue this strategy: we search for a countermodel for ox; if we suc- 
ceed, then Rk, X Ki g and, being Ro C Rx, we conclude that Ro, X i g and 
proof search ends. Otherwise, from the failure we learn the proper A, and A, 
to be used in the application of rule cpl,; in next iteration, proof search restarts 
with the sequent 0,41, where Rk+1 is obtained by adding the learned clause 
Pk to Ry. To check classical provability, we exploit a SAT-solver; each time the 
solver is invoked, the set R; has increased, thus it is advantageous to use an 
incremental SAT-solver. 
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Countermodels Henceforth we define Kripke models by specifying the interpre- 
tations associated with its worlds. Let W be a finite set of interpretations with 
minimum Mo, namely: Mo C M for every M € W. By K(W) we denote the 
Kripke model (W, <, Mo, V) where < coincides with the subset relation C and J 
is the identity map, thus M I- p (in K(W)) iff p € M. We introduce the following 
realizability relation >w between W and implication clauses: 


M pw (a> b) > c iff (ac M) or (b€ M) or (CE M) or 
(4M’ €W st. MC M' anda € M' andb¢ M'). 


By M òw X we mean that M >w À for every \ € X. Countermodels of r-sequents 
can be characterized as follows: 


Proposition 2. Leto = R, X => g be an r-sequent and let W be a finite set of 
interpretations with minimum Mo. Then, K(W) o iff: 
(i) g € Mo; (ii) for every MEW, M = R and M òw X. 


4 The Procedure proveR 


The strategy outlined in Sec. Blis implemented by the decision procedure proveR 
(prove with Restart) defined by the flowchart in Fig. |3| The call proveR(R,X ,g) 
returns Valid if the r-sequent o = R, X => g is IPL-valid, CountSat otherwise; 
by tracing the computation, we can build a C7 -derivation of ø in the former 
case, a countermodel for ø in the latter. We exploit a single incremental SAT- 
solver s: clauses can be added to s but not removed; by R(s) we denote the set 
of clauses stored in s. The solver s has associated a set of propositional variables 
U(s) (the universe of s); we assume that every clause y supplied to s is built over 
U(s) (namely, every variable occurring in y belongs to U(s)). The SAT-solver is 
required to support the following operations: 


— newSolver() 
Create a new SAT-solver. 
— addClause(s, y) // s isa SAT-solver, y a flat clause built over U(s) 
Add the clause y to s. 
— satProve(s, A, g) // s is a SAT-solver, A C U(s), g E€ U(s) U{L} 
Call s to decide whether R(s), A Fe g (A is a set of local assumptions). The 
solver outputs one of the following answers: 
e Yes(A’): thus, A’ C A and R(s), A’ Fe g; 
e No(M): thus, AC M C U(s) and M |} R(s) and g ¢ M. 
In the former case it follows that R(s), A Fe g, in the latter R(s),A Ke g. 
The procedure newSolver(R), defined using the primitive operations, creates 


a new SAT-solver containing all the clauses in R. The computation of the call 
proveR(R, X, g) consists of the following steps: 


(S0) A new SAT-solver s storing all the clauses in R is created. 
(S1) A loop starts (main loop) with empty W. 
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Input Assumptions 


R: finite set of flat clauses Output Properties 
X: finite set of implication clauses Valid implies R,X Fi g 
geVuU{l} CountSat implies R, X i g 


(R, X) is -closed 


s < newSolver(R) (SO) 


—— Wep (S1) 


p — NA\ {a}) >c 


addClause(s, p) 


(S6) 


(S2) 
Yes(A) 


satProve(s, Ú, g) 


(S5) 


(53) 


(w, (a> b) > c) No such (w, A) 


satProve(s, w U {a}, b) W + Wu{M} 


select (w, A) s.t. 


W: set of interpretations 
g: 


weW, NEX A 


flat clause 
(learned clause) 


(S4) 


S2) 


$3) 


S4) 


$5) 


Fig. 3. Computation of proveR(R, X, g). 


The SAT-solver s is called to check whether R(s) Fe g. If the answer 
is Yes(@), the computation stops yielding Valid. Otherwise, the output is 
No(M) and the computation continues at Step |(S3) 

A loop starts (inner loop) by adding the interpretation M computed at 
Step [(S2)] to the set W (thus, W = {M}). 

We have to select a pair (w, X) such that w € W, A € X and wH yA. If such 
a pair does not exist, the procedure ends with output CountSat. Otherwise, 
the computation continues at Step |(S5) 

Let (w, (a — b) > c) be the pair selected at Step |(S4)| The SAT-solver s is 
called to check whether R(s),w,a Fe b. If the result is No(M), then a new 
iteration of the inner loop is performed where M is added to W. Otherwise, 
the answer is Yes(A) and the computation continues at Step [(S6)| we call A 
the learned assumptions and (w,(a— b) > c) the learned pair. 

The clause y (the learned clause) is added to the solver s and the computa- 
tion restarts from Step [(S1)] with a new iteration of the main loop. 
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Note that during the computation no new variables are created, thus U(s) can 
be defined as the set of propositional variables occurring in RU X U {g}. We 
show that the call proveR(R,X,g) is correct, namely: if R, X, g match the Input 
Assumptions, then the Output Properties hold (see Fig. B). We stipulate that: 


— Rp denotes the set R(s) at the beginning of iteration k of the main loop; 
— yx denotes the clause learned at iteration k of the main loop; 
— Wk j denotes the set W at iteration k of the main loop and just after 


Step |(S3)] of iteration j of the inner loop. 
— ~e denotes classical equivalence, namely: a ~e B iff Fe ae b. 


We prove some properties about the computation of proveR(R, X, g). 


(P1) Let k, j > 0 be such that Wp, j is defined. Then: 
(i) The set W;,,; has a minimum element Mo and g ¢ Mo. 
(ii) For every M € Wp, j, M H Rx. 
(iii) If Wk, j+1 is defined, then Wk, j C Wk j+1- 
(P2) For every 0 < h < k such that Yp is defined, Yr £e Pk- 


Let Wk o = {M}; one can easily check that, setting Mo = M ,[(@] holds. Point} (ii) 
follows by the fact that each M in Wk, j comes from an answer No(M), thus 
M EF Rx. Let Wk j+1 be defined and let Wi, 541 = Wg j U{ M }, with M computed 
at step |(S5)} there is w € Wk,j and A = (a + b) > c € X such that why, ,A 
and w U {a} C M and b ¢ M. We cannot have M € Wk, j, otherwise, since 
w C M anda €M and b ¢ M, we would get w >w, , A, a contradiction. Thus 
M Z Wk, j, and this proves [(iii)| 

Let 0 < h < k be such that y, is defined, let (wk, Ap = (ak —> bk) > Ck) 
and A; be the pair and the assumptions learned at iteration k respectively; 
note that A, C wp U {ax}. Since Rp U {yn} = Rh1 C Rg, we have yn € Rk; 
by|(P1){ai)| it holds that wx = Rg, hence wp H pn. We show that wp E pp, and 
this proves |(P2)] Since (wk, Ax) has been selected at Step [(S4)] Ck € wg; by the 
fact that yp = N (Ap \ {ax}) > cy and Ap \ {ax} C we, we conclude wk A yp. 

Exploiting the above properties, we prove the correctness of proveR, also 
showing how to extract derivations and countermodels from computations. 


Proposition 3. The call proveR(R,X,g) is correct. 


Proof. We start by proving that the computation never diverges. By |(P2)} the 
learned clauses y;, are pairwise not classically equivalent; since each y x is built 
over the finite set U(s), at most 2!U()! such clauses can be generated, and this 
proves the termination of the main loop. Since every interpretation M in W is 
a subset of U(s), by|(P1)fiii)| the termination of the inner loop follows. 

Let o = R,X = g. If proveR(R,X,g) returns CountSat, then the com- 
putation ends at Step [(S4)] since no pair (w,A) can be selected. By |(P1)| the 
current set W satisfies the assumptions (i),(ii) of Prop. |2| accordingly, K(W) is 
a countermodel for o, thus R, X -; g. If proveR(R,X,g) outputs Valid, then 
there exists m > 0 such that, at Step [(S2)] of iteration m of the main loop, the 
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SAT-solver yields Yes(Q), hence Rm He g. For every iteration k in 0...m—1 of 
the main loop, let (wk, Ak = (ak > bk) — ce) be the learned pair and A, the 
learned assumptions (thus, Rk, Ax Fe bk). We can apply rule cpl, as follows: 


Rr, Ax Ec bg Rigi N g Pk = N(Ak \ {ax}) => Ck 


Àk 
Rk, X >g j Ro =R, Rp41 = ReU {9x} 


Accordingly, we can build the derivation of R, X = g displayed in Fig. [2] and, 
by Prop. [I| we conclude R, X Fi g. 


As a corollary, we get the completeness of the calculus C”: 
Proposition 4. For every r-sequent o = R,X => g, foo o iff R,X Fi g. 


We give two examples of computations using formulas from the ILTP (Intu- 
itionistic Logic Theorem Proving) library [13]. 


Example 1. Let x be the first instance of problem class SYJ201 from the ILTP 
library [13], where nj; = pi @ pj and y= pı A p2 A p3: 


x = ((m2 > 7) A (mz > 7) A (mı > )) > 7 


The clausification of y yields the triple (Ro, X,g), where X contains the impli- 
cation clauses \g,...,A5 defined in Fig. [4] and Ro the following 17 clauses (we 
mark by a tilde the fresh variables introduced during clausification): 

Po > Da, P3 > p2, p3—>p3, parpi, paps, Ps > ps, Pps — Pa, 

Pi ^A D2 > Po, Be \ Pr > Ps, Po \ Pio > Ps, pı A p2 ^ p3 > J, 

pı > po, pı > Po, p2 > pi, p2 > P7, p3 — Pe, p3 > Pio. 


The trace of the computation of proveR(Ro,X,g) is shown in Fig. Each 
row displays the validity tests performed by the SAT-solver and the computed 
answers. If the result is No(_), the last two columns show the worlds w; in the 
current set W and, for each wz, the list of A such that wk% yA; the pair selected 
at Step is underlined. For instance, after call (0) we have W = {wo} and 
wo%yAr for every 0 < k < 5; the selected pair is (wo, Ao). After call (1), the set 
W is updated by adding the world w; and wih yA3, w1% wAs5 and wot y Ak for 
every 2 < k <5 (since w € W, we get wo Pw Ao); the selected pair is (w1, A3). 
Whenever the SAT-solver outputs Yes(A), we display the learned clause yx. The 
SAT-solver is invoked 15 times and there are 6 restarts. Fig. [4] also shows the 
derivation of Ro, X => g extracted from the computation. 


Example 2. Let w be the second instance of problem class SYJ207 from the ILTP 
library [13], where ni; = pi © pj and y = pi A p2 A ps A pa: 


y = ((m2 > 7) A (M23 > 7) A (ma > 7) A (mı > 7)) > (Po V “po V 7) 


3 With intuit, the set Ro consists of the 11 clauses in the first two rows; the remaining 
6 clauses are added when the —-closure of (Ro, X) is performed (see footnote P). 
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ào = (ps > p2) > P7 
A3 = (p2 > pı) > pe 
wo =0 w = {ps, Pe, Pio} we = {p2, Pı, Pr, Pio} ws = {p3, P2, Pe, Pio} 

wa = {p1, P2, Pe, Po} ws = {p1, P7, Po} we = ws U {p2} w7 = {p1, P2, D7, Po} 


Ai = (ps > pı) > Bo 
(pı > ps) > Pio 


A2 
As 


= (pa > ps) > De 


(pı > p2) > pi 
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@SAT Answer Ww A s.t. WP wA 
Start (0) Ro Fe g No(wo) wo Nass À5 
(1) Ro, wo, p3 Fe p2 No(w1) WL A3, As 
wo A2,- -3 À5 
(2) Ro, w1, p2 Fe pı Yes({p2, Pe}) Po = Pe > P2 
Rest 1 (3) Rib? ğ No(we) w2 At 
(4) Ri, we,p3 Fe pı Yes({ps, p1}) pı = pi > po 
Rest 2 (5) Ro È g No(ws) w3 As 
(6) R2, w3,p1 Fe p2 Yes({p1, P10}) p2 = Pio > Pi 
Rest 3 (7) R; Ke gg No(wa) wa do 
(8) R3, wa, p3 Fe p2 Yes({p3}) p3 = Pr 
Rest 4 (9) R, G No(ws) ws A2, As, A4 
(10) Ra, w5, p2 Fe ps No(we) we Aa 
ws avi 
(11) Ra, we,pi Fé ps Yes({p1, Pı }) p4 = Pi > Pro 
Rest 5 (12) Rs HÈ g No(w7) wr A2 
(13) Rs, w7, p2 Fe p3 Yes({p2}) ps = Pe 
Rest 6 (14) Re HÈ g Yes(0) Valid 
Res Fe g 


Ro, p2, Pe Fe pı 


Rs,p2 Fe pa Re, X > g 


R4, pı, Pı Fe ps Rs,X > Gg 


A4 


R3,p3 Fe po R4, X >G 
R2,pi,Pio Fe p2 Rs, X > g 
Rı, p3, Pı Fe pı Ro, X > 9 N 
R, X >9 


Ro, X > 9 


Xo 


Fig. 4. Computation of proveR(Ro,X,g), see Ex. 


A2 
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ào = (pa > p3) > Pu = = (p4 > pi) > fis -A2 = (p3 > pa) > Pro 
A3 = (p3 > p2) > ps Aa = (p2 > ps) > pr As = (p2 > pi) > Bo 
às = (pı > pa) > Pa Ar = (pı > p2) > Pı às = (po > L) > Gg 


wo = 0 wi = {p4, Pio, Pia} we = {ps, Pr, Pii, Pia} ws = {pa, ps, Pio, Pia} 
wa = ws U {p2, Pı} ws = waU{po,g} we = {P1, P2, Ps, Pio, Piz} 
w7 = {p4, Pi, Ps, Pio, Pia} ws = wrU{p2} wo = w7U {po,g} 


@SAT Answer WwW AÀ s.t. WZ ywy AÀ 
Start (0) Ro Ke g No(wo) wo Noss À8 
(1) Ro, wo, pa FÈ ps No(w3) WL As, A4, À5, A7,A8 
wo A2, EER Às 
(2) Ro, w1, ps Fe pe Yes({p3, P10}) po = Pio > Ps 
Rest 1 (3) Ri He 9 No(wə2) we A1,A5,A7;A8 
(4) Ri,wa,pa He pi Yes({pa, B11 }) pı = Pir > P13 
Rest 2 (5) Ro HÈ g No(w3) w3 Na; As, À7, À8 
(6) Ro, w3, p2 Fé ps No(wa) wa As 
w3 Az, A8 
(7) R2, w4, po H? le No(ws) W5 ff) 
w4 o 
w3 Az 
(8) Ro, ws,pi Fe p2 Yes({pi, Pi4}) p2 = pia > pi 
Rest 3 (9) Ry = g No(we) we Ao, Aa, As 
(10) R3,we,pa Fé ps Yes({p4, P13}) p3 = pis > Pi 
Rest 4 (11) Ra H g No(w7) wr A4, A556 
(12) Ra, w7, p2 te ps No(ws) ws As 
W7 As 
(13) Ra, ws, po Fe L No(wo) wg Ø 
CountSat ws i) 
w7 ) 

W9 Oe) Po, Pi, P3, P2, Po, P2, P4, Pi, 
Ee ME) Dr, Pi, Piz, G Ps; Pio, Pia, Q 
SSS S|] 

Ws P2, Pa, Pi, í Po, P3, P2, | Po; Pa; Pi, P2, Pa, Pi, 
(en EES) Pr, Pit, D13, 9} | Ps, Pro, Pia, 9 Ps, Pio, P14 
>a a 

P4, Pi, 
w7 aes = 
Ps, Pio, Pia ) 
K({w7, ws, wo}) Generated by our implementation of intuit 


Fig. 5. Computation of proveR(Ro,X,g), see Ex. 
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1 procedure prove(R, X, g) 
2 // Same Input Ass. and Output Prop. as for intuitR (Fig. 
3 s + newSolver(R); rT + prAuxC(X, Ú, g) 
4 if r = Yes(@) then return Valid else return CountSat 
5 procedure prAux (X, A, q) 
6 // Output: Yes(A) or No(M), where AC A and MCA 
7 To + satProve(s, A, q) 
8 if to = Yes(A) then return Yes(A) 
9 else // To = No(M) 
10 for \= (a> b) >cEX s.t.a¢M andbg M and c g M do 
11 Ti < prAux(X \ {A}, MU {a}, b) 
12 if mı = Yes(A) then 
13 yp + N(A\{a})—>c;  addClause(s, p) 
14 return prAux(X, A, q) 
15 return No(M) 
16 end 


Fig. 6. The prove procedure of intuit PIO. 


We proceed as in Ex. |1| The clausification procedure yields (Ro, X,g), where X 
consists of the implication clauses Ap,...,Ag in Fig. |5|and the set Ro contains 
the 24 flat clauses below: 


Po > J, pı > P2, Pi > P13, P2 > Pı, P2 > Ps, P3 > P7, P3 > Dii, Pa > Pio, Pa > Pia, 
Po > Ps, P3 > p3, P3 + pa, Pa > p2, Pa > P3, Ps > pi, Ps > Pa, Pe > Ps, Po > Ps 
Pi \ P2 > Do, P7 \ Ps > Pe, Pio A Pii > Po, Pis A Pia > Piz, Piz > Ps, Y> G- 


The execution of proveR(Ro,X,g) (see Fig. requires 14 calls to the SAT- 
solver and 4 restarts. After the last call we get W = {w7, ws, wo} and wk >w X 
for every wg € W, thus the computation ends yielding CountSat. The model 
K(W), depicted at the bottom left of the figure, is a countermodel for Ro, X > g 
and for w (see Sec. [2}. Q 


5 Related Work and Experimental Results 


We compare the procedure proveR of intuitR with its intuit counterpart, 
namely the procedure prove defined in Fig. [6] Here we comply with the pre- 
sentation in [10], equivalent to the original one in [2]. The recursive auxiliary 
function prAux plays the role of the main loop of proveR (but in proveR the set 
of atoms A is not used); the loop inside prAux corresponds to the inner loop of 
proveR. [| We point out some major differences. Firstly, in prAux the interpre- 
tations M computed by the SAT-solver are not collected; in the loop, only the 
interpretation M computed at line Bjis considered, thus at the beginning of each 


* Actually intuit implements a variant of prAux where as much as possible clauses 
y are added to the solver. 
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Req ai Rı,b—> c, X, A >b Ro, 9, X, (a> b)>c>q ni 

R, Xq "° Ra, Ra, X, (a >b) >c q l 
Rı, Xı Fi vy y, Ro, X2 >q r ACV,q4EVU{L} 
Ri, Ro, X, X2 >q us p = NA\{a}) >c 


Fig. 7. The calculus LJT sar. 


iteration just the “local” conditions of the test MwA are checked (line 
Secondly, the call satProve(s, w U {a}, b) to the SAT-solver at Step |( "i 
replaced by the recursive call prAux(X \ {A}, M U {a}, b) at line as a 
consequence, we cannot build derivations by applying rule cpl,. As thoroughly 
discussed in {10}, the calculus underlying intuit is the sequent calculus LJT sar 
in Fig. |7| obtained from C? by replacing the rule cpl, with the more general 
rule ljt and introducing a cut rule. Rule ljt can be seen as a generalization of 
Dyckhoff’s implication-left rule from the calculus LJT (alias G4ip) [3]14]. We re- 
mark that a C’*-derivation is isomorphic to a cut-free LJTs,7-derivation where, 
in every application of rule ljt, the left-premise has a trivial proof (just apply 
rule cpl,). In it is shown how countermodels and LJTsa7-derivations can 
be extracted from prove computations. In brief, countermodels are obtained by 
considering some of the interpretations coming from No(_) answers; countermod- 
els are in general bigger than the ones built by proveR, where at each restart 
the model is emptied. As an example, let co = Ro, X = g be defined as in 
Ex. |2} the computation of prove(Ro,X,g) requires 31 calls to the SAT-solver 
(24 No(_) answers) and the computed countermodel for a9 has 6 worlds (see 
Fig. B); instead, proveR(Ro,X,g) requires 14 calls and the countermodel has 3 
worlds. Derivation extraction presents some awkward aspects. The key insight 
is that, for every recursive call prAux(X, A,q) occurring in the computation of 
prove(R,X,g), if prAux(X,A,q) returns Yes(A) (where A C A), then we can 
build an LJTg,7-derivation of a sequent R, R’, A, X = q, where R’ contains some 
of the clauses added to the SAT-solver. The derivation is built either by applying 
the rule cplg if prAux ends at line 8} or else by applying rule ljt, exploiting the 
derivations obtained by the recursive calls at lines [11] and Accordingly, the 
main call prove(R,X,g) yields a derivation of R, R’, X = g. The crucial point 
is that the redundant clauses y in R’ satisfy R, X Fi y (this ultimately follows 
by property [(a)] in Sec. B), thus we can eliminate them by applying the cut rule. 


Example 3. Let oo = Ro, X = g be defined as in Ex. |1| prove (Ro, X,g) yields 
the LJTsar-derivation Do of Rə, p4, X > g in Fig. |8| By applying the cut rule 
three times, we get an LJTsar-derivation of go. We stress that the C? -derivation 
of o9 obtained with intuitR (see Fig. |4) has a simpler structure. 


Finally, we remark that the clauses y computed in prAux do not enjoy prop- 
erty |(P2)| (Sec. B}; we have experimented cases where such clauses are even 
duplicated (e.g., with formulas from class SY J205 of ILTP library). 
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Rı, pı, Pio Fe p2 R2, p3 Fe p2 
Ro, p2, Pe Fe pı Ri, X 40,5}, P1, Pio > p2 R2, X {0}, P3 > p2 
Ro, X 40,3}; P2, Po = pı Ri, X40}, p3 > p2 
a = Ro, X10}, P3 > p2 


A3 = (p2 > pi) > P2 


R4, pı, Pi Fe p3 R5, p2 Fe ps 


: Ra, X {2,4}, P1; Pi > P3 Rs, X12}, P2 > p3 Re ke 9 
shown da 


above R3,p3 Fe pı Ra, X42}, p2 > ps Re, X > 9 


Rs, X13, P3 => pı Rs,X => g 


Gp as. 


R3, p4, X => g : 
= Ao = (p3 > p2) > Pr 
Ro, p4, X > 9 
à2 = (p2 > p3) > Pe =a = (pı > p3) > Pio — às = (pı > p2) > Pı 


po = Pe > P2 Yı = Pio > Pı p2 =P7 P3 =P9 p4 = Pı > Pio Ys = Pe 
Xr = X\{rA, | k eT} Rept = Re U {pr} 


Fig. 8. Derivation Do of R2, p4, X > g in LJT sar (see Ex. Bh. 


Experimental results We have implemented intuitR in Haskell on the top of 
intuit: we have replaced the function prove with proveR and added some fea- 
tures (e.g., trace of computations, construction of derivations/countermodels); 
as in intuit, we exploit the module MiniSat, a Haskell bundle of the MiniSat 
SAT-solver [4] (but in principle we can use any incremental SAT-solver). We 
compare intuitR with intuit and with two of the state-of-the-art provers for 
IPL by replicating the experiments in [2]. The first prover is fCube [5]; it is based 
on a standard tableaux calculus and exploits a variety of simplification rules [6] 
that can significantly reduce branching and backtracking. The second prover is 
intHistGC [TĪ]; it relies on a sequent calculus with histories and uses dependency 
directed backtracking for global caching to restrict the search space; we run it 
with its best flags (-b -c -c3). All tests were conducted on a machine with 
an Intel i7-8700 CPU@3.20GHz and 16GB memory. We considered the bench- 
marks provided with intuit implementation, including the ILTP library, the 
intHistGC benchmarks and the API problems introduced by intuit developers. 
This amounts to a total of 1200 problems, 498 Valid and 702 CountSat; we used 
a 600s (seconds) timeout. Fig. [9|reports the more significant results, among which 
the classes where at least a prover fails and the classes where intuitR performs 
poorly. In all the tests, the time required by clausification is negligible. Even 
though no optimized data structure has been implemented, intuitR solve more 
problems than its competitors; in families SYJ201 (Valid formulas) and SYJ207 
(CountSat formulas) intuitR outperforms its rivals, in all the other cases, except 
the families EC, negEC and portia, intuitR is comparable to the best prover 
(which is intuit in most cases). The most remarkable improvement with respect 
to intuit occurs with class SYJ212 (see Fig. [L0}, where intuit timings are fluc- 
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Class (number of |intuitR intuit fCube intHistGC 
problems) 
SYJ201(50) 50 (2.259) 50 (11.494) 50 (259.776) |50 (39.466) 
SYJ202(38) 10* (49.265) |10* (50.658) |9* (176.984) |6* (324.673) 
SY J203(50) 50 (0.250) 0 (0.335) 50 (1.671) 50 (0.293) 
SYJ204(50) 50 (0.442) 0 (0.477) 50 (0.972) 50 (0.203) 
SYJ205(50) 0 (0.500) 0 (0.730) 50 (1.317) 50 (4.129) 
SYJ206(50) 50 (0.303) 0 (0.348) 0 (0.759) 50 (0.112) 
SYJ207(50) 50 (2.291) 0 (109.919) 0 (138.546) |50 (1014.476) 
SYJ208(38) 38 (5.225) 8 (5.479) = (2.755) 38 (497.715) 
SYJ209(50) a (0.226) 0 (0.278) 0 (1.690) 50 (0.254 
SYJ210(50) 50 (0.272) 0 (0.252) 0 (0.988) 50 (0.288 
SYJ211(50) 50 (0.462) 0 (1.251) 0 (1.073) 50 (63.686) 
SYJ212(50) 50 (0.669) 42* (587.794) |50 (2.698) 50 (1.624 
EC(100) 100 (2.738) 100 (0.821) 100 (6.183) 100 (0.651) 
negEC(100) 100 (3.614) 100 (1.116) 100 (13.733) |100 (5.807) 
cross(4) 4 (0.100) 4 (0.097) 4 (3.417) 2* (0.005 
jm-cross(4) 4 (0.120) 4 (0.090) 4 (5.404) 3* (4.324 
jm_lift(3) 3 (0.170) 3 (0.133) 3 (6.847) 2* (0.028 
lift(3) 3 (0.119) 3 (0.102) 3 (6.494) 2* (0.012 
mapf(4) 4 (0.187) 4 (0.400) 4 (446.921) 3* (0.043 
portia(100) 100 (32.878) |100 (22.596) |100 (3255.818) |100 (3200.135) 
negportia(100) 100 (7.956) 100 (8.309) 98* (3826.011) |100 (28.289) 
negportiav2(100) |100 (8.081) 100 (8.411) 98* (1264.103) |100 (3212.293) 
nishimura2(28) |28 (9.784) 28 (12.285) 27* (141.326) |28 (7.616) 
Unsolved 28 36 43 38 


Fig. 9. For each prover, we report the number of solved problems within 600s timeout 
and between brackets the total time in seconds required for the solved problems. The 


best prover is highlighted, a star reports that there are some unsolved problems. 


tuating. To give a close comparison, let us consider the case k = 25; clausification 
produces 246 flat clauses and 100 implications clauses (176 atoms). Our intuit 
implementation requires 11214 calls to the SAT-solvers (10181 No(_)) and the 
computed countermodel has 1955 worlds. Instead, intuitR requires 45 calls to 
the SAT-solvers, 8 restarts and yields a countermodel consisting of 4 worlds; the 
set W contains 26 worlds before the first restart, one world before the remaining 
ones. With all the benchmarks the models generated during the computation are 
small (typically, big models occur before the first restart); however, differently 
from [7J8J9], we cannot guarantee that countermodels have minimum depth or 
minimum number of worlds. To complete the picture, the scatter plot in Fig. 


compares intuitR and intuit on all the benchmarks. 
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k intuitR [intuit k intuitR [intuit k intuitR Jintuit 
1.. 24 |< 0.01 |<0.1 31 0.007 8.724 39 0.020 0.404 
25 0.007 0.691 32 0.007 4.216 40 0.016 0.838 
26 0.007 25.064 33 0.012 0.034 41 0.027 - 

27 0.007 0.020 34 0.010 2.445 42 0.020 0.785 
28 0.008 0.083 35 0.033 77.226 43 0.036 435.324 
29 0.009 8.412 36 0.018 0.038 44 0.026 0.098 
30 0.008 = 37 0.016 22.445 45 0.070 0.639 
Problem k: 38 ooir - 2 a p0 Se - 

(...((47p1 © p2) © ps) © ... © pk) © (...((p1 © p2) © ps)... pk) amas L 


Fig. 10. Timings for problems k = 1..50 of SY J212 (CountSat), - means timeout (600s). 
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Fig. 11. Comparison between intuitR and intuit (1172 problems, the 28 problems 
where both provers run out of time have been omitted); time axis are logarithmic, the 
8 red squares indicates that intuit has exceeded the timeout. 


To conclude, we point out that intuitR can be extended to deal with some 
superintuitionistic logics [I]. For instance, let us consider the Géedel-Dummett 
logic GL, characterized by linear models; at any step of the computation of 
proveR, the model K(W) must be kept linear. Whenever the insertion of a new 
world to W breaks linearity, we follow a “restart with learning” strategy [I2]: let 
y = (a > b) V (b > a) be the instance of the GL-axiom falsified at the root of 
K(W); we restart by taking y as “learned axiom”, so to avoid the repetition of 
the flaw. However, we cannot add y to the SAT-solver, because y is not a clause, 
but the clausification of y, namely the clauses ĝi V q2, Gi Aa > b, Q Ab >a, 
where qı and ĝo are fresh atoms; despite the language of the SAT-solver must 
be extended, the process converges. The other generalizations suggested in [2] 
(modal logics, fragments of first-order logic) seem to be more challenging. 
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Abstract. Attestation logics have been used for specifying systems with 
policies involving different principals. Cyberlogic is an attestation logic 
used for the specification of Evidential Transactions (ETs). In such trans- 
actions, evidence has to be provided supporting its validity with respect 
to given policies. For example, visa applicants may be required to demon- 
strate that they have sufficient funds to visit a foreign country. Such ev- 
idence can be expressed as a Cyberlogic proof, possibly combined with 
non-logical data (e.g., a digitally signed document). A key issue is how 
to construct and communicate such evidence/proofs. It turns out that 
attestation modalities are challenging to use established proof-theoretic 
methods such as focusing. Our first contribution is the refinement of Cy- 
berlogic proof theory with knowledge operators which can be used to 
represent knowledge bases local to one or more principals. Our second 
contribution is the identification of an executable fragment of Cyberlogic, 
called Cyberlogic programs, enabling the specification of ETs. Our third 
contribution is a sound and complete proof system for Cyberlogic pro- 
grams enabling proof search similar to search in logic programming. Our 
final contribution is a proof certificate format for Cyberlogic programs 
inspired by Foundational Proof Certificates as a means to communicate 
evidence and check its validity. 


Keywords: Attestation Logics - Proof Search - Sequent Calculus 


1 Introduction 


Attestation logics have been used for the specification of poli- 
cies of distributed systems, such as access control systems [I], distributed autho- 
rization policies [14]21], and evidential transactions (ETs) [I5]5/6J6[29]. In these 
logics, one specifies policies involving attestation formulas of the form K > F, 
where K is a principal (or agent) in the system. 

Cyberlogic is an attestation logic for ETs. In Cyberlogic, cryptographic keys 
K are identified with specific authorities, and attestations K:> A express the 
fact that principal K attests to statement A. For example, K may be a visa- 
granting authority and A the statement that the visa requester is authorized 
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to enter the specified country by the end of the year and at most once. An 
evidential transaction might issue a visa given that proof of sufficient funds has 
been provided in the form of a digital certificate whose validity can then be 
verified by customs authorities upon entry. 

Formally, evidence in ETs can be expressed as a Cyberlogic proof. To carry 
out an ET, a Cyberlogic proof demonstrating policy compliance shall be pro- 
duced and communicated. ETs therefore enable trust in, for example, distributed 
exchanges in electronic commerce, by enabling the exchange of various forms of 
verifiable evidence, such as evidence of funds in the visa example above. 

The problem of producing attestation logic proofs (and proof objects) has 
not been given enough attention so far. Attestation logics have been formalized 
as Hilbert-style proof systems [I[15] that do not have the sub-formula property 
and therefore are not suitable for proof search. Other works on authorization 
logics have proposed sequent calculi which do possess the sub-formula 
property. However, the search space is too great to enable efficient proof search. 

The established proof-theoretic method for proof search is focusing [BILS]. Fo- 
cusing distinguishes between inference rules that have “don’t know” and “don’t 
care” non-determinism to prune the proof search space. Interestingly, focused 
proof systems [7[18] provide a proof-theoretical justification for backward and 
forward-chaining, two proof-search strategies for Horn clauses (logic programs). 
Such justification, however, breaks when programs contain modalities, such as at- 
testation modalities, i.e., formulas of the form K :> F. This is because focusing is 
lost whenever any of these formulas is encountered and therefore, improvements 
to the search space because of focusing is not so significant for attestation logics. 

Our main goal is the study of Cyberlogic’s proof theory in order to enable 
proof search (similar to the search involved in logic programming) and the gen- 
eration of proof certificates for the communication of evidence in ETs. 

Our first contribution, detailed in Section 2] is a Gentzen style proof system 
for Cyberlogic that admits cut elimination. A feature of the proof system is that 
it enables the combination of evidence represented as logical derivations as well as 
digital evidence, e.g., signed hashes of documents, financial statements, medical 
records. The logic also includes a knowledge operator for sets of principals. 

Our second contribution, detailed in Section B] is the identification of a frag- 
ment of Cyberlogic, called Cyberlogic programs, akin to Horn clauses used in 
logic programming. This is motivated by the ongoing work on building dis- 
tributed logic programming engines for ETs which extend existing engines 
with attestations of the form K :> A. 

Our third contribution, also detailed in Section B] addresses the challenge of 
how to efficiently construct Cyberlogic program proofs. We propose a focused 
inspired proof system for Cyberlogic programs and prove that it is sound and 
complete in this fragment. This system enables more efficient proof search. 

Our last contribution, detailed in Section [4] addresses the challenge of how to 
efficiently communicate evidence. We propose a proof certificate format for Cy- 
berlogic programs inspired by Foundational Proof Certificates (FPCs) [9]. FPCs 
enable the reconstruction of proofs by using simple logic programs as guides. This 
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Fig. 1. CLk — Cyberlogic proof system for K = {Ki,...,Kn}. Here A is an atomic 
formula, Q C K, and I |o= {kbg:F | kbo F € I A QO’ C Q}. Moreover, in rules 3z 
and Vr, a is a fresh constant not appearing in I’ nor F. 


means that such certificates can elide parts that can be easily reconstructed or 
which one is willing to reconstruct. 


2 Cyberlogic Proof Theory 


Cyberlogic is an intuitionistic modal logic which can be used for specifying 
ETs. The logic is parametrized by a finite set of principals K = {Kj,...,Kn}, 
which are used in formulas as follows: 


— K;:> F: meaning that principal K; attests the (Cyberlogic) formula F; 

— kbgF’, where Q C K: meaning that all principals in Q know F, or, alterna- 
tively, that the combined knowledge of principals in Q imply F; and 

— evidencex, A: standing for an external evidence signed by principal K;. 


External evidences are left unspecified since they fall outside the logical scope 
and depend on the ET being formalized. For example, evidencex,;A could be 
signed hashes of tickets, financial statments, medical records, etc. In Cyberlogic 
the evidence associated with an ET is a combination of a formal proof (in sequent 
calculus) and a collection of external evidences. 

Cyberlogic formulas are constructed according to the following grammar: 


P,G:=A|FAG|FVG|FODG|T|L| KF | kboF | Va.F | 3x.F 


where A is an atom, K € K, and Q C K. The formula K :> F is read as “principal 
K attests F” and acts like the says modality in lax logics [13127|. The formula 
kboF is read as “principals in Q know F” and is inspired by the knows modality 
used in linear authorization logics [J421]. Different from that logic, Cyberlogic 
allows the direct specification of knowledge shared by multiple principals, as 
illustrated in Example [I] 

Cyberlogic sequents are of the shape I’ —+ G, where I’ is a multiset of 
formulas. The Cyberlogic proof system, CLyx, is depicted in Figure [I] Rules for 
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the intuitionistic connectives ^, V, D,Y,J are as in LJ [30]. The new rules are 
the ones involving assertions K :> F and kbo. Note that a “built-in” contraction 
of the main formula is needed on the left premise of D; and the premise of V;, as 
expected in intuitionistic logics. Also, the rule kb; has an explicit contraction on 
the premise. These contractions are needed for cut admissibility (Theorem 2}. 

Rules :>; and :>, specify that :> is a lax modality [27J21)24]. The intuition 
behind :>, is: if an assertion G of a principal K is provable using F’, then it 
is also provable if K attests F. Rule :>,. specifies that principals are rational, 
i.e., they can always attest formulas that are derivable. Differently from existing 
systems with lax modalities, CL has the rule ext. This rule allows a proof of 
an attestation K:> A to be completed whenever a principal provides evidence 
evidence, A for the claim A. This formalizes the intuition that principals may use 
digital evidence signed by their private key. We leave the definition of evidence 
unspecified as it depends on the intended ET specified. 

Rules kb; and kb, refine Cyberlogic by enabling the collection of logical the- 
ories known by a set of principals. Such theories act as knowledge bases. Rule kb; 
specifies that any common knowledge can be part of a knowledge base. The in- 
teresting rule is kb,, which specifies that kbo F can only be proved using the local 
knowledge or evidence provided by principals in Q. This is formally captured by 
restricting I” in kb,’s premise to the set I |o= {kbo F | kbg FE TAQ’ C Q}. 
This is a powerful construct that increases the expressiveness of Cyberlogic. In 
particular, it is straightforward to specify that certain assertions can be con- 
cluded from the shared knowledge of a set of principals. 


Proposition 1. The following sequents are provable in CLę for all K € K and 
formulas F,, Fə. F = F> represents the sequents (Fı — Fə) and (Fy — F): 


1. F>KDF 8. kbo(Fi A F2) = kbo Fi A kbo F> 
2. kbok — F 9. (K:> Fi VK > Fo) > K > (F; V F2) 
3. kbgg F — K> F 10. kbo AV kboB — kbo(A V B) 
4 E = ied 11; K:>(F, D Fh) > (K> Fi DKD Fd) 
5. kbo F — kboF, if Q' C Q. In par- 

ticular, kbo,kbo, F — kbouo, F. 12 kbo(Fi D Fe) — (kbo Fi D K > Fo) 
6. kbo, F A kbo, F — kboiug, F 13. K:>(Vx.F) = Va.K > F, V € {v,3} 
7 K:>(Fi A Fo) = K> F AK D Fo 14. kbo(Va.F) = Vz.kboF, V € {V, 3} 


Moreover, the following sequents are not provable if Kı A Kz and Qı # Q2: 


1 KoF- F 6. kboruo, F kbo, F, i € {1,2} 
2. F -A kbo F 7. kbo,u0,F + kbo, F A kbo, F 
4. Kı (Ko > F) => Ko (Ki > F) 8. kboK :> A -> K :> kbo A 

5. kbo, (kbo, F) 4 kbo, (kbo, F) 9. K: kbo A 4 kboK :> A 


In the remainder of the paper, we elide the set of principals K whenever it 
can be deduced from the context. 
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Example 1. (Shared Knowledge) The ability to use kb with multiple princi- 
pals allows the derivation of facts that depend on the combination of knowledge 
of multiple principals. Consider that principal Kı knows A and B D C, and 
principal Kə knows A D B, then the following sequent is provable in CL: 


kbyx,}A, kbiK,}B8 DC, kbyx,}A D B — kb{k: Ko} C 


Remark 1. The original Cyberlogic paper [5] (and technical report [4]) proposed 
two kinds of attestations, :> and >, to distinguish when an attestation is derived 
from a digital evidence or logical inferences. This combination, however, does not 
yield to a proof system with the cut-elimination property [28]. 


The meta-theory of CL has been analysed using the L-framework [25], which 
uses rewriting logic to automatically derive structural proofs of sequent calculi 
properties |26|. The following lemma was used in the proofs of cut-elimination 
and invertibility. 


Lemma 1. /fI,K:> F — G, then I, F — G. 


The proof proceeds by structural induction on the derivation of T, K :> F —> 
G. The proof has been mechanically checked using the the L-framework with 
some few cases proved by hand. 

As expected, Dr, Ar, Az, Vi, Vr, ay are invertible whereas Vr, >), Vi, 4, are not 
invertible. In addition, the rules :>; and kb; are invertible whereas the :>,. and 
kb, are not invertible. 


Lemma 2. /fI,K:> F — KG then I,F — K> G. 


This is a simple corollary of Lemma [1] Invertibility of kb; is straighforward 
because of the contraction of the main formula. 
Rules :>, and kb, are not invertible. The counter examples are: 


[:>,] K:D>a— KDa but Kipa/fa 
[kb,] a,a D kbkb —> kbxb but 46 


Weakening is height perserving admissible in CL. 


Theorem 1 (Identity expansion). F —> F is provable in CL for any cyber- 
logic formula F. 


The proof is by structural induction on F. 
Theorem 2 (Cut elimination). If r — F and T, F — C, then l — C. 


The proof proceeds by a nested induction on the structure of the proofs of 
I’ — F and I, F — C, and the formula F. The noteworthy cases are the 
ones where cut needs to permute over kb rules. For kb;, contraction of the main 
formula is needed, and the permutation over kb, can be done only if cut is 
principal on the left (which is a lemma that can be proved). Details about these 
transformations are in Appendix [A] 
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3 Cyberlogic Programs 


Cyberlogic programs are fragment of CL which resembles Horn clauses in logic 
programming. Section [3.2] proposes a proof search operational semantics for cy- 
berlogic programs and proves its soundness and completeness. The proof search 
discipline relies on ideas from focusing [3]. Focused proof systems for LJ pro- 
vide a proof theoretical justification of forward and backward chaining search. 
Each technique is enforced by the choice of polarity of atomic formulas: positive 
atoms lead to forward chaining and negative atoms lead to backward chaining. 
This correspondence, however, does not extend to cyberlogic due to attesta- 
tion formulas K :> A which cause focusing to be lost [2I]. Consider the following 
example where the formula under focus is in brackets: 


Ki :> a —> [Ki >a] Ki :ba, [Ko > b] — K2 :> b 
Kı :> a, [Ki :> a D K2 > b] — K2 :> b 


D1 


In focused proof systems, forward chaining can be enforced by disallowing focus 
to be lost on the right formula in the left premise, i.e. [Ki :> a]. However, if :>, is 
applied to this sequent the premise would be Kı :> a —> a, which is not provable 
(see Proposition [Ip. In fact, [K1 :> a] must lose focus on the right for the proof 
to be completed. Therefore, if :> modalities are used in logic programs, other 
strategies for proof search need to be analysed. 


3.1 Cyberlogic Program Syntax 


Cyberlogic programs can be divided into goals, knowledge bases, common knowl- 
edge, and attestation clauses. 


Goals (G) Cyberlogic programs are used to derive a goal G, defined as: 


G = T | Ki kbgA | Gi A Go | az.G 


where A is an atomic formula. The restriction of :> kbg to atoms does not reduce 
the expressiveness of goals, given the equivalences in Proposition 


Knowledge Bases (B): A knowledge base, written kb,,,}I", of a principal K; € K 
is a set of formulas J” not containing the connectives :> or kb. Here, kbs, ,} I" 
represents the set of formulas {kbs g}, F | F €T}. 

Intuitively, a knowledge base kbs,,}/" can be interpreted as K;’s local knowl- 
edge. This means that K; may use its own prover to derive new facts. For ex- 
ample, if I’ is a collection of Horn-clauses, then K; may deploy a Prolog engine 
to derive some goal. Alternatively if I’ is a set of formulas in CNF form, then 
K; may use resolution provers. The absence of modal connectives in knowledge 
bases has important impacts on the design of the proof certificate described in 
Section [4] as those may rely on existing certificates for different provers [9]. 
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Common Knowledge (C): Common knowledge are knowledge bases that are 
known to all principals, written as kbg T. Since Ø C Q for every Q, these formulas 
remain in the context when applying kb,.. In this sense they contain first order 
formulas that may be used by all principals. 


Attestation Formulas (D): Formulas of the form K :> kbo A are derived by attes- 
tation formulas of the form below where for all 1 < i < n, K: € K, Qi CK, and 
Aj,.--,An,A are atomic formulas and X are bounded by universal quantifiers: 


VX.(kbo, (Ki: A1) A ++- A kbo, (Kn :> An) A G D K :>(kbp A)) 
VX.(kbo, (Kı > A1) A+++ A kbo, (Kn > An) A G D K :>(kbsx} A)) 


Intuitively, an attestation formula belongs to a principal, namely K in the 
right-hand side of D. Such formulas derive K’s attestation of an atomic formula 
which is its own knowledge (kb,,}A), or common knowledge (kbg A). This means 
that K’s attestation formulas cannot derive knowledge belonging to other prin- 
cipals. Furthermore to derive an attestation, one can use the knowledge base 
of other principals, i.e. the formulas kbg,(K; ‘> A;) or additional goals, i.e. G. 
Finally notice that K :>(kbgA) and K :>(kb,«}A) are attestation formulas them- 
selves, where the left-hand side of D is empty (denoting T). 

The difference between formulas K :> A and K :>(kb,«}A) is subtle. Note that 
the former can be derived using the evidence rule ext, while the latter cannot. 
K >(kby,} A) is K’s attestation that A follows from its local knowledge base. 
It is possible to specify that A can be derived from an external evidence, but 
this has to be made explicit by an attestation formula, e.g., kbsq}(K > A) > 
K :>(kb;y,}A). Note that this formula is not a tautology. 


We are interested in proving goals from attestation formulas, knowledge 
bases, and common knowledge, which are formally represented by cyberlogic 
program sequents defined as follows. 


Definition 1 (Cyberlogic Program Sequents (CPS)). A cyberlogic pro- 
gram sequent (CPS) is a sequent C,B,D —> G, where B is a set of knowledge 
bases, C is a set of common knowledge formulas, D is a set of attestation for- 
mulas, and G is a goal formula. 


Example 2. (Local Computations) This example illustrates the use of kb to 
specify when parts of a derivation can be proved locally using a principal’s 
knowledge. Consider that the following clause 


kbyx,} (Ki Cb Fi) A kbx,}(K2 > F>) DKD kbik} G 


specifies that for K to attest G, Kı and Kə have to attest Fı and Fə respectively, 
using their own local theories, common knowledge, or evidence. This means that 
computations carried out by Kı and Kg to derive their assertions K; :> Fı and 
Kə :> F respectively, do not depend on other principals and therefore, the search 
for these derivations can be performed locally. 
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Example 3. (Levels of Trust) This example illustrates the use of kb to specify 
that some evidence should only be trusted if derived from trusted sources. Con- 
sider three principals K = {Kr,Ky,K} where K trusts evidence from Kr, but 
not all evidence from Ky. Then the following clause 


kb{k, kr} (K > critical(ok)) A kbx (K :> nonCritical(ok)) D K :> kbg(all(ok)) 


specifies that K can attest that everything is ok as a common knowledge if all 
the non-critical and critical elements are ok. However, the check of critical parts 
can only be performed by principals K trusts, namely K itself or Kr. Information 
from Ky’s knowledge bases cannot be used in the proof of critical(ok). 


Example 4. (Simplified Visa) Consider a visa issuing scenario where an appli- 
cant applies to a consulate (cons) for an entry visa. This is an example of an 
ET as, to obtain the visa, evidence has to be provided that, for example, the 
applicant has no crime records, or that they have sufficient funds. We illustrate 
how such an ET can be specified in Cyberlogic. 

The formula below labelled main specifies conditions for a visa to be issued: 


main: VId.VDoc.VV. (kb {cons} (cons :> visitOk(Id, Doc)) 
A kb¢cons} (cons :> prepVisa(Id, V)) 
A cons :> kb¢cons} (sufFin(Doc)) A police :> kb ¢ police} (NOCrimeRec(Id)) 
D cons :> kbeons(issVisa(Id, Doc, V))) 


The transaction for cons issuing a visa V to an applicant Id requires cons to attest 
validity of Id’s visit by itself (visitOk(Id, Doc)) and Id’s criminal record with the 
help of the police (noCrimeRec(Id)). In addition, cons also needs to attest Id’s 
financial status (sufFin(Doc)). 

The following two clauses expand on how cons can attest sufFin(Doc): either 
via an employment contract or a bank statement. 


cont: kb ¢cons} (YDoc.YCont.( empContract(Doc, Cont) A valid(Cont) 
D sufFin(Doc))) 
bankStmt: VDoc.VStmt. (kb {cons} (cons :> bankStmt(Doc, Stmt)) 
A bank :> kb bank} (valid(Stmt)) D cons :> kb ¢cons} (sufFin(Doc))) 

The formula labeled cont belongs to cons’s knowledge base. This means that cons 
can check the validity of an employment contract without evidence from other 
principals. For example, valid(Cont) may check the contract duration and salary. 
The formula labeled bankStmt, on the other hand, takes the bank statement 
Stmt from the given documents, Doc, and requires the bank to validate it using 
its knowledge base. This makes sense as Id’s financial records are sensitive and 
do not need to be disclosed to anyone else apart from her financial institute. 

These clauses also illustrate the subtle difference between goal formulas 
K :> kbs g} F and knowledge base formulas kb,;,.}K :> F . For example, in the 
main clause, the fact that applicant has come to their appointment at the con- 
sulate does not depend on other agents and that is why we use a knowledge base 
formula. The same applies to the visa preparation. On the other hand, the fact 
that applicant has sufficient funds may require evidence from other parties, e.g., 
the applicant’s bank. Therefore this is specified as a goal. 
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Goal decomposition 


0; A; A — [Gi] 0; A; A — [G2] O; A; A — [G[t/z]] 


Tr r r 
0; A; A — [T] 0; A; A — [Gi A G2] 0; A; A — [Ax.G] 
O; A; [A] — K :> kbo A O |a— A 
>; :>r +kb, + kb; 
O; A; A — [K > kbo A] O; A; A — [K :> kbo A] 
:>; application 
O, kbo A; A; [A] — K > kbg' A’ i 
0; A; [A,K > kbo A] > Kr kbg A’ 
O; [A]; At — K :> kbo A 0; A; At — [K > kbo A] 


D1 at D1 G 
O; A; [At] — K > kbo A O; A; [At] — K > kbo A 


Attestation formula decomposition 


0; A; A — [Go] ©; A;[A,K :> kboAo] — K’ :> kbo; A’ 
O joys: — [Ki > Aro] ++) O Jon; — [Kn > Ano] 


0; [4, YX. (kbo, (Ki :> A1) A+++ A kbo, (Kn :> An) A G D K > kbo A)]; A — K’ :> kbg A’ 


att 


K :> A decomposition 


evidencek A i o* A ; 
O;-;- — [K >A] i 0; — [K> A] ` 


First-order reasoning: 


All first-order rules from CL on O©* —> A sequents 


Fig. 2. CLp — Sequent calculus for cyberlogic programs. A, A’ and A; are atoms, At 
is such that for all K' :> kbo A’ € Al, K’ Æ K, and O* = {F | kboF € O}. 


3.2 CPS Proof Search 


Proof search of CPS can be divided into the following phases: goal decomposition, 
:>; application, attestation formula decomposition, K :> A decomposition, and 
first-order reasoning. We define a (focusing inspired) sequent calculus for the 
CPS fragment, called CLp (Figure }2) for enforcing this proof search discipline. 
Sequents in CLp have the following shape: O; A; A —> F, where O contains 
kb formulas, A contains attestation formulas, A contains formulas of the form 
K :> kbgA, and F is either a goal formula, kbg(K :> A), K:> A or A, where A is 
an atom. Moreover, the part of the sequent containing the formula that is being 
decomposed will be enclosed in square brackets. This will help distinguishing the 
phases mentioned above. 


Lemma 3. The kb, rules permutes down every left rule in the CPS fragment. 


Proof. First we note that, in the CPS fragment, A, V, V, and kb formulas on the 
left do not have kb modalities as subformulas. We look at the case of kb;, as the 
others follow a similar argument. 
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Since F is not a kb formula, then F ¢ (I, kbg/F, F) |g. Therefore we can 
conclude that (I, kbg: F, F) |o= (I, kbo: F) |o and the permutation is: 


Y 
(T, kbo F, F) |o— G kb 


p 
T kbo F, F — kboG |” (T, kba F) la G | 
T,kbo F —> kboG " ~ T kbo F—>kboG ” 


The case for :>; holds vacuously, as it is impossible to have :>; immediately 
below kb, since the former requires the right formula to be of the shape K :>. 

The remaining case is Dı. Observe that in the CPS fragment, the formula 
Fə in F, D fF is of the form K:> kbo A. Therefore, (I, F2) |o= I |g. Also, 
(T, Fy D F2) |o= T |g. Thus the permutation is: 


p 
(T, F2) |o— G i p 

PF D aka *" (Fi 2 F)lo>G | 

T, Fi > Fy —> kboG © a TORS re — hbo” 


Notice that it is crucial for attestation formulas to have a :> modality formula 
on the consequent, otherwise Lemmaf3]would not hold. As seen below, this lemma 
is key to proving completeness of the proof search procedure for CPS. 


Theorem 3 (Soundness and completeness of CLp). O;A;A — [|F] in 
CLp if and only if O, A, A — F in CL 


Proof. Soundness is straightforward: a proof in CLp can be transformed into a 
proof in CL by using the same logical rules (possibly expanded — e.g. att becomes 
a sequence of Y+ D; +/A,-+kb,) and skipping the phase transition rules => (which 
only change the syntax of the sequent, but not its content). 

Completeness is achieved by reasoning about invertibility and permutability 
of inference rules in the specific case of CPS. We argue that each phase can be 
performed in the proposed order. 


Goal decomposition The goal formula can be eagerly decomposed until 
becoming K :> kbo A before applying other rules because: T, and ^, are invert- 
ible, and in the absence of V, and 4,, J3, permutes down every rule. Once the 
right side formula is K :> kbgA, there are two options to continue: (1) change to 
:>; application phase, or (2) apply rules :>,.+kb, + kb; in Figure [I] 

The first case is discussed below. In the second case, we need to argue that 
kb, may be applied immediately above :>,. Once :>,. is applied, we could choose 
a formula from the context to continue with. However, kb, permutes down all 
left rules for the CPS fragment, as shown in Lemma 3| Therefore any proof that 
continues with a formula in O, A, or A above :>, can be transformed into a 
proof where kb, is applied immediately above :>,.. Since kb; is invertible, it can 
be applied to exhaustion safely. 


‘>, application After eagerly decomposing the goal, :>; can be applied to 
exhaustion since it is an invertible rule (Lemma [2}. 
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Attestation formula decomposition This phase contains only one rule, 
namely att, which encompasses V;, Dı, Ar, and kb,. The quantifier rule can 
always be delayed until its subformula is needed, and ^p is an invertible rule, 
therefore these can be chained together without loss of completeness. Due to 
Lemma [3] the application of kb, can be permuted down for the CPS fragment 
and thus it is safe to apply the rule as soon as possible. 

The two top premises of att force the proof search to go back to applying 
invertible rules, which does not break completeness. 


K :> A decomposition Once this state is reached, O is left with kb formulas 
whose subformulas are in first-order logic (i.e., no modalities). In this case, one 
can either close the proof with an external evidence, or apply :>,.-+kb; to release 
the atom on the right side. The eager application of kb; is justified due to its 
invertibility. It can also be delayed until this point because it permutes up D; 
and :>, in CL, and it permutes up kb, in the CPS fragment (Lemma f3). 


First-order reasoning From this point onwards, there are no modalities in 
the sequent so it will be proved using only first-order reasoning. 


4 Proof Certificates 


Cyberlogic programs may be used to derive facts about attestation (goals), us- 
ing pure logical reasoning (knowledge bases), principal delegation (attestation 
formulas), and external evidence. Once a goal is derived, evidence shall be avail- 
able so that any interested party can verify that the proof is correct. Verifiable 
evidence means that entities do not need to trust each other’s proof producing 
process, as long as they can check the proofs using their own trusted processes. 

Given a cyberlogic program sequent of the shape: O; A; A — G one could 
take its full sequent calculus proof in CLp as evidence. If the interested parties 
know the calculus, checking validity of proofs reduces to checking the valid ap- 
plication of each rule. However, these proofs are too fine grained, and contain 
many uninteresting details that can be easily inferred. Proof certificates elide 
such details, and keep only the crucial steps for proof reconstruction. 

Proof certificates for cyberlogic are defined inspired by A-terms and founda- 
tional proof certificates [S120] (FPC). FPC is a framework for checking proofs 
in different formalisms using a small trusted kernel. The proposed kernels are 
the sequent calculus focused systems LKF and LJF [18] for LK and LJ respec- 
tively, augmented with predicates for guiding proof search [9]. The definition of 
proof certificates for a proof system S relies on two parts: (1) a translation of 
S’s formulas into LKF or LJF formulas; and (2) a correspondence of S proofs 
(or proof steps) to LKF or LJF proof steps. Given these two elements, a proof 
certificate for a proof of F in S consists of a predicate which guides a proof of 
F’ s translation in LKF or LJF. The following proof formats can be checked in 
FPC: resolution, A-terms, Horn clauses, Frege proofs, matings, tableaux, etc. 

Defining LKF or LJF FPCs for cyberlogic is challenging due to the modalities 
:> and kb, and digital evidences. LKF has been used to check proofs in modal 
logics [I9], but the translation of modal formulas into LK formulas used the 
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7 = :0; A; A — [G[t/c]] 
top:0@;4;A—[T] " 5:0;A;A — [A2.G] 


r 


&,:0;A;A — [Gi] 52: O; 4; A — [Gə] 
split(£1, £2) : O; A; A — [Gi A Go] 


r 


Z : 0; A; [A] — K :> kbgA v: 0 |ġ— A 
toSaysı (£) : O; A; A —> [K > kbo A] i 


:>r +kbr + kb 
fol) : 0; 4; A — [K> kboA] 7 ! 


Z : O, kbo A; A; [A] — K > kbo; A’ 


D 
=: 0;A;[A,K:> kbg A] —> K:>kbg A’ 
= : O; [4]; At — K :> kbo A M M Z : 0; A; At — [K > kbo A] é 
- : a Hee 
toAtt(=) : 0; A; [A] > Ki>kbgA ` toGoal(=) : O; A; [A] — Kr kbgA 
Z' : 0; 4; A — [Go] =" : O; A; [A, K :> kbo Ao] — K' :> kbg A’ 
51: 0 lai: [Kı D> Aio] © En: 0 lon; — [Kn :> Ano] i 
a 
atio (51, .«)2n);2,2"): 


O;[A,i: VX.(kbo, (Ki > AL) A+++ A kbo, (Kn > An) AG D K > kbgA)]; A — K' > kbg A’ 


evidence (E, A) Vv: O* >A 


ext(E) : @;-;- — [K :> A] ii fol(W) : 0;;: — [Ki A] >r +kb; 


Fig. 3. CL — CLp kernel for verifying CLp proof certificates of Cyberlogic programs. 
At is such that for all K’ :> kbh A’ € Al, K’ #4 K and ©* = {F | kboF € O}. 


modalities’ semantic definition. Instead, we propose a modular CLp kernel which 
allows facts derived from knowledge bases or external evidence to be checked by 
the appropriate engine or entity. 

The CLp kernel CL$ (Figure[3) is constructed by augmenting sequents with a 
certificate = (a term indicating how the proof must proceed) and indices for the 
formulas in A. A certificate for a proof of O; A; A — G is £ : O; Ar; A — G, 
where © is a term built from the predicates used in CL$, and A; is a mapping 
from indices to formulas in A. The indices are used in =. The checking of a 
cyberlogic sequent ©; A; A — G with certificate = starts from the sequent 
=: O; Ar; A — [G]. Certificates denoted by the letter Y can represent proofs 
in other formalisms and may be checked by another engine. The predicates in £ 
are used for the following purposes during a derivation in CL$. 

First of all, they indicate how the proof should continue when there are mul- 
tiple choices. For example, if the sequent is of the form O; A; A — [K :> kbg A}, 
then = must be one of toSaysı (-) or fol(_), indicating whether to work on :> 
modalities on the left, or finish the proof with first-order reasoning, respectively. 

Secondly, certificates relay information at the appropriate moment. For ex- 
ample, split(_, _) contains the certificates for each of the branches on a splitting 
rule, and ext(_) includes an external evidence for proposition A. Note that there 
is no certificate for 3p since these can be instantiated with meta-variables, and 
unification can be verified when the proof is completed. 

The certificate for rule att is more interesting. It includes the index i of the 
attestation formula to be decomposed, the substitution ø for the V quantifier, 
and certificates for each premise. Note that each £, ...=,, must be ext(_) or fol(_). 
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Example 5. Consider Example |4| and let the indices of the formulas be their 
labels: main, cont, and bankStmt. The certificate for a proof that alice can get 
a visa is 5 : cont; main, bankStmt;- — cons :> kb¢cons} issVisa (alice, doc, visa). 
Where £ is: 


att(main, {Id +> alice, Doc + doc, V +> visa}, [fol(isitox), fol(Worepvisa)], Za, Z0) 


The certificates Wisitok and Wprepvisa are first-order logic proof certificates from 
derivations using the consulate’s own knowledge base. 

Certificate =o corresponds to att’s premise where the conclusion of main is 
added to the context. This branch can be closed by removing the modalities, so 
=p = toGoal(fol(id)), where id is a first-order logic directive to close the proof. 

Certificate Æg guides the proof of the new goal: 


cons :> kb ¢cons} (SufFin(doc)) A police :> kb police} (NOCrimeRec(alice)) 


and thus Zg = split(=tin, Scime). =tin depends on how cons decides to check 
for sufficient funds. It could rely on the bank and use the attestation formula 
bankStmt, in which case =, has the shape 


toSaysı (toAtt(att(bankStmt, _, -, -, -))) 


Or it could use cont from its knowledge base, in which case =, would be fol(_). 


5 Related Work 


Attestation logics have been proposed for the specification of policies of several 
distributed systems [14J2T[15]5]29[1]. We have been inspired by some of this work 
in the design of Cyberlogic. Actually, Cyberlogic was proposed some decades 
ago P95], but until now its proof theory had not been carefully investigated. In 
particular, there were no statements on cut-elimination. Additionally, we have 
been inspired by the previous works on authorization logics [42IJ5] to extend 
Cyberlogic with knowledge operators. 

The main contribution of our work is the study of proof search and proof 
certificates for attestation logics with knowledge operators. 

In previous work in intuitionistic authorization logic, knowledge was 
restricted to one principal. As demonstrated in Example [I] allowing for multiple 
principal knowledge databases ensures collaboration in reasoning. 

Proof search for attestation logics is not adequately addressed in the liter- 
ature. Either the proposed proof systems are Hilbert-style which do 
not enjoy the sub-formula property and therefore are not suitable for proof 
search, or they are sequent calculus proof system, but not focused proof sys- 
tems [14]2T[29J5]16). [74] only speculates that logic programming languages can 
be used to carry out proof search for fragments of attestation logic. We confirm 
this speculation with the definition of Cyberlogic programs. 

Our main inspiration for proof certificate is the work on foundational proof 
certificates [9]. However, the existing work did not consider proof certificates for 
attestation logics. Closer to our objective is the work of Libal and Volpe [I9], 
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which define proof certificates for modal logics by encoding (the semantics of) 
these logics in LKF. Our work instead proposes proof certificates directly in 
Cyberlogic. This means that we are able to capitalize on rules, such as attestation 
rules, to build more compact certificates. Another difference is that our proof 
certificates may contain (pointers to) extra-logical evidence. 

Cyberlogic has been formalized in Coq [LI], encoding evidential transactions 
for Schengen Visa applications. Our approach is different in that it lays a proof 
theoretic foundation to Cyberlogic. In particular, proof search is formally justi- 
fied as well as the representation of Cyberlogic proofs as FPCs. 

Logic programming engines, such as ETB [10], have been proposed for pro- 
gramming ETs. However, these engines do not (yet) support attestations, such 
as K:> F, local knowledge, such as kbgF’, nor the use of digital certificates. We 
believe that this work can greatly profit from the foundations laid by this paper. 

Finally, works propose the use of evidence for authorization. Specifi- 
cally, show that a fragment of their system is decidable in linear time. It 
would be interesting to investigate how this fragment relates to Cyberlogic pro- 
grams, and whether proof certificates as defined in this work can be applied to 
the decidable fragment. This is left for future work. 


6 Conclusions 


This paper lays the proof-theoretic foundations for Cyberlogic, an attestation 
logic for evidential transactions, and refine Cyberlogic with epistemic modalities. 
We identify a fragment of Cyberlogic, Cyberlogic programs, and propose a proof 
system similar to focused proof systems for enabling sound and complete proof 
search. The necessary permutations for completeness rely on the careful interplay 
between attestation, :>, and knowledge modalities, kbo. We then propose a 
concise proof certificate format for proofs of Cyberlogic programs. 

This paper is the first step for a framework enabling evidential transactions 
that we are currently implementing. In particular, we are extending Distributed 
Datalog engines available in to support Cyberlogic. Moreover, we are in- 
tegrating such engines with PKI infrastructure, available in, for example, Dis- 
tributed Ledger Technologies. This means that evidence, both in the form of 
digital evidence and logical derivations in the form of FPCs, can be stored and 
audited through the Ledger Technologies. 

We are currently investigating extensions to Cyberlogic programs to include 
other modalities, such as temporal and epistemic while still preserving 
its good proof search properties. We have also started to study conditions for 
when two attestation rules can be introduced in any order. If two clauses can be 
introduced in any order, then they can also be introduced in parallel. Therefore, 
this would provide proof-theoretic justification for proof search optimization. 
This could be used, for example, for proposing refinements to dependency graphs 
used for evaluating distributed logic programming which take principals into 
account. These results will impact the maintenance of evidential transactions, 
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whose applications can have important consequences to, e.g., certification in 
automotive and avionics domains. 
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A Cut-elimination 


Proof. (Sketch) The proof follows the usual Gentzen strategy of reducing the 
cuts’ grade and rank. The interesting cases are rank reduction over kb rules. 

In the case of kb;, contraction of the main formula is needed for the permu- 
tation to work. If this was not the case, we could not conclude T, A — G from 
T,kbo A — G. The transformations are: 


pı p2 + weakening 
P,kbod,A—+C | be Toi, ASC T, kboA,4 C > G 
T,kbgA—>C Tko 0 —>G T kboA4 A >G y cut 
cu kb 


I,kbg A — G ~~ I,kbgA, — G 


pı + weakening 


p2 
i L.kboA4, AC >G y T kboA A = C hia ACG 
T,kbęoA—C P,kbg dA, OC G Tkbod,A—>G cut 
T, kbo > G E aes T,kbo4 œG 


The other interesting case is when we need to permute a cut over a kb, rule 
on the right branch: 
po 
Yı (T,C) lo G 
r— C T,C — kbo,G 
Tr — kbo,G 


kb, 


cut 


There are two cases to consider: 


1. C= kbo, C’ and Q; < Qj: in this case, we can permute the cut over rules on 
1 (left rules except :>z, which is never applicable) until it is principal. This 
lemma can be proved by case analysis. At this point, the premise on the left 
branch will be I |g,—> C’. Then kbp can be applied to the end-sequent, 
resulting in: 

pı pa 
I |o — kbg,C’ T |o, kbo, — G 
T |o, — G 
TI — kbg,G 


cut 


kb, 


The proof p% is exactly p2, since (T,kbo, C’) |g,= I |o, kbo, C” when 
Qi < Q;. The proof %4 is obtained from the proof of I |o, — C’, since 
T lo, P lo, when K; x Qj. 

2. C#kbg,C’ or Qi £ Qj : in this case C ¢ (T, C) |o,, so kb, can be applied 
directly to the end-sequent, and the cut can be removed. 


Proof Search and Certificates for Evidential Transactions 249 


References 


10. 


11. 


12. 


13. 


14. 


. Abadi, M.: Logic in Access Control. In: 18th IEEE Symposium on Logic in Com- 


puter Science (LICS) Proceedings. pp. 228-233. IEEE Computer Society (2003). 
https://doi.org/10.1109/LICS.2003.1210062 

Abadi, M., Burrows, M., Lampson, B.W., Plotkin, G.D.: A Calculus for Access 
Control in Distributed Systems. ACM Trans. Program. Lang. Syst. 15(4), 706- 
734 (1993). https: //doi.org/10.1145/155183.155225 

Andreoli, J.M.: Logic Programming with Focusing Proofs in Lin- 
ear Logic. Joural of Logic and Computation 2(3), 297-347 (1992). 
https://doi.org/10.1093/logcom/2.3.297 

Bernat, V.: First-Order Cyberlogic Hereditary Harrop Logic. Tech. rep., 
SRI International (2006), |http://www.lsv.ens-cachan.fr/Publis/PAPERS/PS/ 
Bernat, V., Ruess, H., Shankar, N.: First-order Cyberlogic. Technical Report CSL- 
SRI-04-03, SRI International Computer Science Laboratory (2004) 

Blass, A., Gurevich, Y., Moskal, M., Neeman, I.: Evidential Authorization. In: 
Nanz, S. (ed.) The Future of Software Engineering. pp. 73-99. Springer (2010). 
https: //doi.org/10.1007/978-3-642-15187-3_5 

Chaudhuri, K., Pfenning, F., Price, G.: A Logical Characterization of For- 
ward and Backward Chaining in the Inverse Method. In: Furbach, U., 
Shankar, N. (eds.) Automated Reasoning, Third International Joint Confer- 
ence, IJCAR, Proceedings. pp. 97-111. Springer Berlin Heidelberg (2006). 
https://doi.org/10.1007/11814771_9 

Chihani, Z., Miller, D., Renaud, F.: Foundational Proof Certificates in First-Order 
Logic. In: Bonacina, M.P. (ed.) CADE-24 - 24th International Conference on Au- 
tomated Deduction. Proceedings. Lecture Notes in Computer Science, vol. 7898, 


pp. 162-177. Springer (2013). https: //doi.org/10.1007/978-3-642-38574-2_11 


Chihani, Z., Miller, D., Renaud, F.: A Semantic Framework for Proof Evidence. 
J. Autom. Reasoning 59(3), 287-330 (2017). https: //doi.org/10.1007/s10817-016- 
9380-6 

Cruanes, S., Hamon, G., Owre, S., Shankar, N.: Tool Integration with the Ev- 
idential Tool Bus. In: Giacobazzi, R., Berdine, J., Mastroeni, I. (eds.) Verifi- 
cation, Model Checking, and Abstract Interpretation, 14th International Con- 
ference, VMCAI. Proceedings. pp. 275-294. Springer Berlin Heidelberg (2013). 
https: //doi.org/10.1007/978-3-642-35873-9_18 

Dargaye, Z., Kirchner, F., Tucci-Piergiovanni, S., Gürcan, O.: Towards Secure and 
Trusted-by-Design Smart Contracts. In: JFLA (2018) 

DeYoung, H., Garg, D., Pfenning, F.: An Authorization Logic With Ex- 
plicit Time. In: Proceedings of the 21st IEEE Computer Security Foun- 
dations Symposium, CSF. pp. 133-145. IEEE Computer Society (2008). 


https://doi.org/10.1109/CSF.2008.15 


Fairtlough, M., Mendler, M.: Propositional Lax Logic. Inf. Comput. 137(1), 1-33 
(1997). https: //doi.org/10.1006/inco.1997.2627 

Garg, D., Bauer, L., Bowers, K.D., Pfenning, F., Reiter, M.K.: A Linear Logic of 
Authorization and Knowledge. In: Gollmann, D., Meier, J., Sabelfeld, A. (eds.) 
Computer Security - ESORICS 2006, 11th European Symposium on Research in 
Computer Security, Proceedings. pp. 297-312. Springer Berlin Heidelberg (2006). 


https://doi.org/10.1007/11863908_19 


250 Nigam et al. 


15. Gurevich, Y., Neeman, I: DKAL:  Distributed-Knowledge Authoriza- 
tion Language. Tech. Rep. MSR-TR-2008-09, Microsoft Research (Jan- 
uary 2008), https://www.microsoft.com/en-us/research/publication/ 
16. Gurevich, Y., Neeman, I: DKAL 2 - A Simplified and Improved Au- 
thorization Language. Tech. Rep. MSR-TR-2009-11, Microsoft Re- 
search (2009), https: //www.microsoft.com/en-us/research/publication/ 
200- dkal- 2-a-simplified-and-improved-authorization-language/ 
17. Gurevich, Y., Neeman, I.: Logic of infons: The propositional case. ACM Trans. 


Comput. Log. 12(2), 9:1—-9:28 (2011). https://doi.org/10.1145/1877714.1877715 


18. Liang, C., Miller, D.: Focusing and polarization in linear, intuitionis- 
tic, and classical logics. Theor. Comput. Sci. 410(46) 
https://doi.org/10.1016/j.tcs.2009.07.041 


4747-4768 (2009). 


19. Libal, T., Volpe, M.: A general proof certification framework for 
modal logic. Math. Struct. Comput. Sci. 29(8), 1344-1378 (2019). 
https: //doi.org/10.1017/S0960129518000440 

20. Miller, D.: Foundational Proof Certificates. In: Delahaye, D., Paleo, B.W. (eds.) All 
about Proofs, Proofs for All, All about Proofs, Proofs for All, vol. Mathematical 
Logic and Foundations, 55, pp. 150-163. College Publications (2015), 


inria.fr/hal-01239733 


21. Nigam, V.: A framework for linear authorization logics. Theor. Comput. Sci. 536, 

21-41 (2014). https: //doi.org/10.1016/j.tcs.2014.02.018 

22. Nigam, V., Jia, L., Loo, B.T., Scedrov, A.: Maintaining distributed logic programs 

incrementally. Computer Languages, Systems & Structures 38(2), 158-180 (2012). 

https://doi.org/10.1016/j.cl.2012.02.001 

23. Nigam, V., Olarte, C., Pimentel, E.: A General Proof System for Modalities in 

Concurrent Constraint Programming. In: D’Argenio, P.R., Melgratti, H.C. (eds.) 

CONCUR 2013 - Concurrency Theory - 24th International Conference. Proceed- 

ings. Lecture Notes in Computer Science, vol. 8052, pp. 410-424. Springer (2013). 

https: //doi.org/10.1007/978-3-642-40184-8 29 

24. Nigam, V., Pimentel, E., Reis, G.: An extended framework for specifying 
and reasoning about proof systems. J. Log. Comput. 26(2), 539-576 (2016). 


https://doi.org/10.1093/logcom/exu029 
Olarte, C.: L-framework. https://carlosolarte.github.io/L-framework/ 


03-01-2021 

26. Olarte, C., Pimentel, E., Rocha, C.: Proving Structural Properties of Sequent Sys- 
tems in Rewriting Logic. In: Rusu, V. (ed.) Rewriting Logic and Its Applications 
- 12th International Workshop, WRLA 2018, Held as a Satellite Event of ETAPS, 
Proceedings. Lecture Notes in Computer Science, vol. 11152, pp. 115-135. Springer 
(2018). https: //doi.org/10.1007/978-3-319-99840-4_7 

27. Pfenning, F., Davies, R.: A judgmental reconstruction of modal logic. 
Mathematical Structures in Computer Science 11(4), 511-540 (2001). 


https: //doi.org/10.1017/S0960129501003322 

28. Reis, G.: Observations about the proof theory of cyberlogic. http: //www.gisellereis.| 
lom/papers/cyberlogic-repor. pa (2019) 

29. Ruess, H., Shankar, N.: Introducing Cyberlogic (2003) 

30. Troelstra, A.S., Schwichtenberg, H.: Basic Proof Theory. Cambridge University 
Press (1996) 


25: accessed on 


Proof Search and Certificates for Evidential Transactions 251 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 

The images or other third party material in this chapter are included in the chapter’s 
Creative Commons license, unless indicated otherwise in a credit line to the material. If 
material is not included in the chapter’s Creative Commons license and your intended 
use is not permitted by statutory regulation or exceeds the permitted use, you will need 
to obtain permission directly from the copyright holder. 


®) 


Check for 
updates 


Non-clausal Redundancy Properties * 


Lee A. Barnett® and Armin Biere® 


Johannes Kepler University Linz 
Altenbergerstrake 69, 4040 Linz, Austria 
{lee . barnett ,armin.biere}@jku.at 


Abstract. State-of-the-art refutation systems for SAT are largely based 
on the derivation of clauses meeting some redundancy criteria, ensuring 
their addition to a formula does not alter its satisfiability. However, there 
are strong propositional reasoning techniques whose inferences are not 
easily expressed in such systems. This paper extends the redundancy 
framework beyond clauses to characterize redundancy for Boolean con- 
straints in general. We show this characterization can be instantiated to 
develop efficiently checkable refutation systems using redundancy prop- 
erties for Binary Decision Diagrams (BDDs). Using a form of reverse 
unit propagation over conjunctions of BDDs, these systems capture, for 
instance, Gaussian elimination reasoning over XOR constraints encoded 
in a formula, without the need for clausal translations or extension vari- 
ables. Notably, these systems generalize those based on the strong Prop- 
agation Redundancy (PR) property, without an increase in complexity. 


1 Introduction 


The correctness and reliability of Boolean satisfiability (SAT) solvers is critical 
for many applications. For instance SAT solvers are used for verifying hardware 
and software systems (e.g. [19,28,44]), to search for solutions to open problems 
in mathematics (e.g. [38,46]), and as subroutines of other logical reasoning tools 
(e.g. [7,67]). Solvers should be able to provide solution certificates that are easily 
and externally checkable. For a satisfiable formula, any satisfying assignment is 
a suitable certificate and typically can be easily produced by a solver. For an 
unsatisfiable formula, a solver should be able to produce a refutation proof. 
Modern SAT solvers primarily refute unsatisfiable formulas using clausal 
proof systems, such as the popular DRAT system [69] used by the annual SAT 
competition in recent years [4], or newer systems based on the surprisingly strong 
Propagation Redundancy (PR) property [33]. Clausal proof systems iteratively 
extend a formula, typically given in conjunctive normal form (CNF), by adding 
clauses that are redundant; that is, their addition to the formula does not affect 
whether it is satisfiable. Systems are distinguished by their underlying redun- 
dancy properties, restricted but efficiently-decidable forms of redundancy. 
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Redundancy is a useful notion in SAT as it captures most inferences made 
by state-of-the-art solvers. This includes clauses implied by the current formula, 
such as the resolvent of two clauses or clauses learned during conflict-driven 
clause learning (CDCL) [8,51], as well as clauses which are not implied but de- 
rived nonetheless by certain preprocessing and inprocessing techniques [43], such 
as those based on blocked clauses [42,45,48]. Further, clausal proof systems based 
on properties like PR include short refutations for several hard families of formu- 
las, such as those encoding the pigeonhole principle, that have no polynomial- 
length refutations in resolution [2] (see [16] for an overview). These redundancy 
properties, seen as inference systems, thus potentially offer significant improve- 
ments in efficiency, as the CDCL algorithm at the core of most solvers searches 
only for refutations in resolution [9]. While the recent satisfaction-driven clause 
learning (SDCL) paradigm has shown some initial success [35,37], it is still un- 
clear how to design solving techniques which take full advantage of this potential. 

Conversely, there are existing strong reasoning techniques which similarly ex- 
ceed the abilities of CDCL alone, but are difficult to express using clausal proof 
systems. Important examples include procedures for reasoning over CNF formu- 
las encoding pseudo-Boolean and cardinality constraints (see [58]), as well as 
Gaussian elimination (see [12,61,62,68]), which has been highlighted as a chal- 
lenge for clausal proof systems [31]. Gaussian elimination, applied to sets of 
“exclusive-or” (XOR) constraints, is a crucial technique for many problems from 
cryptographic applications [62], and can efficiently solve, for example, Tseitin for- 
mulas hard for resolution [64,66]. This procedure, implemented by CryptoMin- 
iSAT [62], Lingeling [10], and Coprocessor [50] for example, can be polynomially 
simulated by extended resolution, allowing inferences over new variables, and 
similar systems (see [56,60]). However due to the difficulty of such simulations 
they are not typically implemented. Instead solvers supporting these techniques 
simply prevent them from running when proof output is required, preferring less 
efficient techniques whose inferences can be more easily represented. 

This paper extends the redundancy framework for clausal proof systems to 
include non-clausal constraints, such as XOR or cardinality constraints, pre- 
senting a characterization of redundancy for Boolean functions in general. We 
demonstrate a particular use of this characterization by instantiating it for func- 
tions represented by Binary Decision Diagrams [13], a powerful representation 
with a long history in SAT solving (e.g. [14,23,24,52,54]) and other areas of au- 
tomated reasoning (e.g. [15,29,47,57]). We show the resulting refutation systems 
succinctly express Gaussian elimination while also generalizing existing clausal 
systems. Results using a prototype implementation confirm these systems al- 
low compact and efficiently checkable refutations of CNF formulas that include 
embedded XOR constraints solvable by Gaussian elimination. 

In the rest of the paper, Section 2 includes preliminaries and Section 3 
presents the characterization of redundancy for Boolean functions. Section 4 
introduces redundancy properties for BDDs, and Section 5 demonstrates their 
use for Gaussian elimination. Section 6 presents the results of our preliminary 
implementation, and Section 7 concludes. 
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2 Preliminaries 


We assume a set of Boolean variables V under a fixed order < and use standard 
SAT terminology. The set of truth values is B = {0,1}. An assignment is a 
function 7 : V > B and the set of assignments is BY. A function f : BY > B 
is Boolean. If f(r) = 1 for some r € BY then f is satisfiable, otherwise f is 
unsatisfiable. Formulas express Boolean functions as usual, are assumed to be 
in conjunctive normal form, and are written using capital letters F and G. A 
clause can be represented by its set of literals and a formula by its set of clauses. 

A partial assignment is a non-contradictory set of literals a; that is, if 1 € o 
then ~l Z o. The application of a partial assignment ø to a clause C is written 
C| and defined by: C|, = T if every r € BY that satisfies /\,_, | also satisfies C, 
otherwise C|- = {l | 1 € C and l, =l ¢ o}. For example, (£1 V 22)|t-=2,,2.3 = T; 
and (x1 V @2)|{29,723} = (#1). Similarly the application of ø to a formula F 
is written F'|, and defined by: F|, = T if Cl, = T for all C € F, otherwise 
F|o = {C]o |C € F and C|, # T}. Unit propagation is the iterated replacement 
of F with F|,,, for each unit clause (/) € F, until F includes the empty clause 
L, or F contains no unit clauses. A formula F implies a clause C by reverse unit 
propagation (RUP) if unit propagation on F A ~C ends by producing [27]. 

For a formula F and clause C, if F and F ^ C are equisatisfiable (both sat- 
isfiable or both unsatisfiable) then C is redundant with respect to F. Efficiently 
identifiable redundant clauses are at the foundation of many formula simpli- 
fication techniques and refutation systems (for instance, see [32,33,37,43]). In 
general, deciding whether a clause is redundant is complete for the complement 
of the class DP [6], containing both NP and co-NP [55], so solvers and proof sys- 
tems rely on polynomially-decidable redundancy properties for checking specific 
instances of redundancy. The following characterization of redundant clauses 
provides a common framework for formulating such properties. 


Theorem 1 (Heule, Kiesl, and Biere [36]). A clause C 4 L is redundant 
with respect to a formula F if and only if there is a partial assignment w such 
that C|u = T and Fla F Flu, for the partial assignment a = {~l | LE C}. 


The partial assignment w, usually called a witness for C, includes at least one 
of the literals occurring in C, while a is said to block the clause C. Redundancy 
properties can be defined by replacing F in the theorem above with efficiently- 
decidable relations R such that R C F. Propagation redundancy (PR) [33] re- 
places F with F1, where F F, G if and only if F implies each D € G by RUP. 
The property PR gives rise to a refutation system, in which a refutation is a 
list of clauses C41,...,Cn and witnesses w1,...,W, such that Cy|., = T and 
(F APT Ci)lox Fi (F ART! Ci)|u, for all 1 < k <n, and FA", Cika L. 

Most redundancy properties used in SAT solving can be understood as re- 
stricted forms of propagation redundancy. The RAT property [43] is equivalent 
to literal propagation redundancy, where the witness w for any clause C may 
differ from the associated a on only one literal; that is, w = (a \ {7l}) U {1} for 
some l € C [36]. The DRAT system [69] is based on RAT, with the added ability 
to remove clauses from the accumulated formula F N C;. 
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Vv 


Rapp > PRgpp RUPgpp 


O od 


R ++, PR [33] — RAT [43] = RUP [27] 


v 


RCNF+XOR GE [68] 


RCNF+Card : >» CR [39] 


Fig. 1: Different notions of redundancy and their relationships. An arrow from A 
to B indicates A generalizes B. Properties to the right of the thick dashed line 
are polynomially checkable; those to the right of the thin dotted line only derive 
logical consequences. Novel properties defined in this paper are grey. 


3 Redundancy for Boolean Functions 


Theorem 1 provides a foundation for clausal proof systems by characterizing re- 
dundant clauses in a convenient way. However, the restriction to clauses places 
limitations on these systems, making some forms of non-clausal reasoning diffi- 
cult to express. For solvers aiming to construct refutations in these systems, this 
translates directly to restrictions on which solving techniques can be used. 

We show this characterization can be broadened to include redundancy for 
non-clausal constraints, and can be used to define useful redundancy properties 
and refutation systems. The contributions of this paper are divided into three 
corresponding levels of generality. The top level, covered in the current section, 
is the direct extension of Theorem 1 from redundancy for clauses, written R, 
to redundancy for Boolean functions, written Ry. The middle level, the focus of 
Section 4, instantiates the resulting Theorem 2 to define the refutation systems 
RUPgpp and PRgpp based on redundancy for Binary Decision Diagrams. At 
the bottom level, these systems are shown to easily handle Gaussian elimination 
(GE) in Section 5, as well as some aspects of cardinality reasoning (CR). The 
relationships between these notions of redundancy are shown in Figure 1. 

Each level of generality is individually important to this work. At the bottom 
level, the straightforward expression of Gaussian elimination by RUPppp and 
PRgpp makes it more feasible for solvers to use this efficient technique with 
proof production, especially as these systems generalize their clausal analogs 
already in use. The results in Section 6 confirm the usefulness of RUPppp for 
this purpose. At the middle level, we show the notion of redundancy instantiated 
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for BDDs in this way may be capable of other strong forms of reasoning as well. 
Finally, the top level provides a very general form of redundancy, independent 
of function representation. This may make possible the design of redundancy 
properties and refutation systems in contexts where the BDD representation 
of constraints is too large; for example, it is known that some pseudo-Boolean 
constraints can in general have exponential size BDD representations [1,41]. 

This section presents in Theorem 2 a characterization of redundancy for 
Boolean functions in general. One way of instantiating this characterization is 
demonstrated in Section 4 where the functions are represented by Binary Deci- 
sion Diagrams; the resulting refutation systems are shown in Section 5 to easily 
express Gaussian elimination. However, the applicability of Theorem 2 is much 
broader, providing a foundation for redundancy-based refutation systems inde- 
pendent of the representation used. 

Proofs of theoretical results not included in the text can be found in an 
extended version of this paper [5]. We begin with the property Ry. 


Definition 1. A Boolean function g is redundant with respect to a Boolean 
function f if the functions f and f Ag are both satisfiable, or both unsatisfiable. 


As we will see, extending Theorem 1 to the non-clausal case relies on the notion 
of a Boolean transformation, or just transformation: a function y: BY > BY, 
mapping assignments to assignments. Importantly, for a function f and trans- 
formation y, in fact f oy : BY —> B is a function as well, where as usual 
foy(r) = f(y(r)). For instance let F = gı A z2 and for all r € BY, the 
transformation flips xı, so that y(r)(a1) = aT(a1), and ignores x2, that is, 
(rT) (x2) = T(x2). Then in fact F o y is expressed by the formula 721 A x9. 
Composing a function with a transformation can be seen as a generalization 
of the application of a partial assignment to a formula or clause as defined in 
the previous section. Specifically, for a partial assignment ø let ô refer to the 
following transformation: for any assignment 7, the assignment G(r) satisfies 
Nico l and & ignores any x € V such that x, =x ¢ ø. Then for any formula F 
the formula F'|, expresses exactly the function F o ô. In particular, if œ is the 
partial assignment blocking a clause C then notice C o G(r) = 0 for all 7, but â 
ignores variables not appearing in C; consequently â(T) = T if 7 already falsifies 
C. Generalizing this idea to transformations that block non-clausal constraints is 
more complicated. In particular, there may be multiple blocking transformations. 


Example 1. Let g be the function g(r) = 1 if and only if Tr(a) Æ T(b) (i.e. g is 
an XOR constraint). Transformations a1, a2 are shown in the table below. 


rla) T(b)| g ||a(T)(a) ai(T)(b)| go a1 ||a2(T)(a) a2(T)(b)| go a2 
0 00 0 0 0 0 0 0 
0 ılı 0 0 0 1 1 0 
1 of 1 0 0 0 0 0 0 
1 1] 0 1 1 0 1 1 0 


Both transformations ignore all x 4 a,b. Notice if g(r) = 0 then 7 is unaffected 
by either transformation, and g o a,(T) = g © Q2(T) = 0 for any assignment T. 
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However a; and az are different, so that, for example, if F = ~a A (b V c) and 7 
satisfies the literals sa, b, and c then F o a;(T) = 1 but F o a2(T) = 0. 


Motivated by this we define transformations blocking a function as follows. 


Definition 2. A transformation a blocks a function g if goa is unsatisfiable, 
and for any assignment T if g(r) =0 then a(t) =T. 


Notice any g not equal to the constant function 1 has blocking transformations; 
for example, by mapping every T satisfying g to a particular assignment falsifying 
it. Using this definition, the following theorem shows how the redundancy of a 
Boolean function g with respect to another function f can be demonstrated. 
This is a direct generalization of Theorem 1, using a transformation blocking g 
in the place of the partial assignment blocking a clause, and a transformation w 
such that gow is the constant function 1 in place of the witnessing assignment. 


Theorem 2. Let f be a function and g a non-constant function. Then g is 
redundant with respect to f if and only if there exist transformations a and w 
such that a blocks g and gow is the constant function 1, and further foa F fow. 


Proof. (=>) Suppose g is redundant with respect to f and let a be any transfor- 
mation blocking g. If f is unsatisfiable then f oa is as well, so that foa F fow 
holds for any w. Thus we can take as w the transformation w(r) = r* for all 
T € BY, where T* is some assignment satisfying g. If instead f is satisfiable, by 
redundancy so is f ^A g. Here we can take as w the transformation w(T) = 7* for 
all r € BY, where r* is some assignment satisfying f \ g. Then both f ow and 
gow are the constant function 1, so that foaF fow holds in this case as well. 

(<) Suppose a,w meet the criteria stated in the theorem. We show that g 
is redundant by demonstrating that if f is satisfiable, then so is f A g. Suppose 
T is an assignment satisfying f. If also g(r) = 1, then of course 7 satisfies 
f Ag. If instead g(r) = 0, then a(t) = T as a blocks the function g. Thus 
foa(r) = f(a(r)) = f(r) = 1. As foatF fow, this means f(w(7)) = 1. As 
gow is the constant function 1 then g(w(r)) = 1, so w(r) satisfies f A g. 


The clausal characterization in Theorem 1 shows that the redundancy of a 
clause can be evidenced by providing a witnessing assignment and demonstrating 
that an implication holds, providing a foundation for refutations based on the 
iterative conjunction of clauses. Theorem 2 above shows that the redundancy of 
a function in general can be seen in the same way by providing transformations 
a and w. Consequently this suggests how to construct refutations based on the 
iterative conjunction of Boolean functions. 


Definition 3. A sequence o = (g1,01,1),---;(Gn,;Qn,Wn) is a redundancy se- 
quence for a Boolean function f if: 


1. ax blocks gk and gk owp is the constant function 1, for alll1<k<n, 
2. PANS gi) o ag E (FA NEE gi) owk, foralll<k<n. 
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As for clausal redundancy, refutations are intuitively based on the following: if 
gi is redundant with respect to f, and gz is redundant with respect to f A g1, 
then f and fAgi Age are equisatisfiable; that is, gı A g2 is redundant with respect 
to f. The following holds as a direct consequence. 


Proposition 1. Let f be a Boolean function. If (g1,01,1),---;(Gn,;Qn,Wn) is 
a redundancy sequence for f, and f \ \j_, gi is unsatisfiable, then so is f. 


This shows, abstractly, how redundant Boolean functions can be used as a ba- 
sis for refutations in the same way as redundant clauses. To define practical, and 
polynomially-checkable, refutation systems based on non-clausal redundancy in 
this way, we focus on a representation of Boolean functions that can be used 
within the framework described above. Specifically, we consider sets of BDDs in 
conjunction, just as formulas are sets of clauses in conjunction. Clauses are easily 
expressed by BDDs, and thus this representation easily expresses (CNF) formu- 
las; this is necessary as we are typically interested in proving the unsatisfiability 
not of functions in general, but of (CNF) formulas. It is important to notice this 
is only a particular instantiation of Theorem 2, and that other representations 
of Boolean functions may give rise to useful and efficient systems as well. 

BDDs [3,13,49] are compact expressions of Boolean functions in the form of 
rooted, directed, acyclic graphs consisting of decision nodes, each labeled by a 
variable « € V and having two children, and two terminal nodes, labeled by 0 
and 1. The BDD for a function f : BY —> B is based on its Shannon expansion, 


f = (zA f o ĉo) V (£A f 041) 


where co = {7x} and g1 = {x}, for x € V. As is common we assume BDDs are 
ordered and reduced: if a node with variable label x precedes a node with label 
y in the graph then x < y, and the graph has no distinct, isomorphic subgraphs. 
Representation this way is canonical up to variable order, so that no two distinct 
BDDs with the same variable order represent the same Boolean function [13]. 
Our use of BDDs for representing non-clausal redundancy relies on the con- 
cept of cofactors as developed in BDD literature. The functions f oo and fod, 
are called literal cofactors of f by ~x and x, respectively, and are usually written 
flax and f|,. The cofactor of f by a conjunction of literals c = lı A---Al, can be 
defined similarly, so that f|. = foe, for the partial assignment o, = {l1,...,ln}-. 
This notation is the same as for the application of a partial assignment to a clause 
or formula from Section 2, as the notions coincide. More precisely, if a formula 
F and BDD f express the same function, so do the formula F'|,, and BDD fle. 
More broadly, for BDDs f and g, a generalized cofactor of f by g is a BDD 
h such that f Ag = h A g; that is, f and h agree on all assignments satisfying 
g. This leaves unspecified what value A(T) should take when g(r) = 0, and vari- 
ous different BDD operations have been developed for constructing generalized 
cofactors [20,21,22] The constrain operation [21] produces for f and g, with g 
not equal to the always false 0 BDD, a generalized cofactor which can be seen 
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as the composition f o mg, where my is the transformation [63]: 


T if g(r) =1 


g(t) = arg min d(T, T) otherwise. 


{7 | g(7/)=1} 


The function d is defined as follows: d(T, T’) = X; 4 |r(ai) —7'(ai)|-2"~"*, where 
V = {z1,..., £n} with z1 < +- < £n. Intuitively, d is a measure of distance 
between two assignments based on the variables on which they disagree, weighted 
by their position in the variable order. It is important to notice then that the 
transformation 7, and the resulting f o mg depend on the variable order, and 
may differ for distinct orders. For a conjunction of literals c, though, fom. = fle 
regardless of the order, so that f|, refers to f o 7, in general. 

As the transformation 7, maps an assignment falsifying the function g to 
the nearest assignment (with respect to d) satisfying it, a transformation that 
blocks the function g can surely be obtained as follows. 


Lemma 1. /f g is not equal to the constant function 1 then T~g blocks g. 


This form of generalized cofactor, as computed by the constrain operation, 
is well suited for use in redundancy-based reasoning as described above, as the 
transformation 7, depends only on g. As a consequence, for BDDs fı and f2 
in fact (f1 A fa)|ag = fil-g A folag; that is, the BDD (f1 A f2)|~g expresses the 
same function as the BDD for the conjunction fi|., A f2|~g. Thus given a set of 
BDDs fi,..., fn we can represent (f1 ^+- A fn)|-g simply by the set of cofactors 
fil-g and without constructing the BDD for the conjunction fi A+++ A fn, which 
is NP-hard in general. In particular, given a formula F = C1 A--- AC, anda 
Boolean constraint g, the function F|_, can be represented simply by applying 
the constrain operation to each of the BDDs representing C;. Therefore, from 
Theorem 2 we can characterize redundancy for conjunctions of BDDs, written 
Rppp, as follows. 


Proposition 2. Suppose fi,..., fn are BDDs and g is a non-constant BDD. If 
there is a partial assignment {l1,..., lg} such that for w = Nia li, 


Filag AA Fal-g y= Filo AA falow 


and glu = 1 then g is redundant with respect to fi N... ^ fn: 


4 BDD Redundancy Properties 


The previous section provided a characterization of redundancy for Boolean 
functions, and showed how this could be instantiated for BDDs. In this section we 
develop polynomially-checkable properties for showing that a BDD is redundant 
with respect to a conjunction of BDDs, and describe their use in refutation 
systems for proving the unsatisfiability of formulas. 
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UnitProp(fi,..., fn) 


1 repeat 

2 if fi =0 or fi = af; for some 1 < i,j < n then 
3 return “conflict” 

4 if U(fi) 40 for some 1 < i< n then 

5 fi := fila uc, for alll <j<n 

6 until no update to fi,..., fn 


Fig. 2: A procedure for unit propagation over a set of BDDs 


As Theorem 1 is used for defining clausal redundancy properties, Proposi- 
tion 2 gives rise to BDD redundancy properties by replacing F with polynomially- 
decidable relations. Similar to the use of the unit propagation procedure by the 
clausal properties RUP and PR, we describe a unit propagation procedure for 
use with a set of BDDs and derive analogous properties RUPgpp and PRgpp. 

For a BDD f, the Shannon expansion shows that if f|4, = 0 (i.e. f|, is the 
always false 0 BDD) for some literal l, then f = LA fı, and therefore f = l. Then 
the units implied by f, written U(f), can be defined as follows. 


Definition 4. U(f) = {l | var(l) € V and f|- = 0}, for f : BY > B. 


As f|- can be computed in O(|f|), where |f| is the number of nodes in the BDD 
for f [59], then U(f) can certainly be computed in O(|V|-|f|) C O(|f|?), though 
this can be reduced to O(| f|). We write A U(f) to mean Arcu) l 

Figure 2 provides a sketch of the unit propagation procedure. Whenever U(f) 
is non-empty for some f in a set of BDDs, each BDD in the set can be replaced 
with its cofactor by A U(f). This approach to unit propagation is largely similar 
to that of Olivo and Emerson [53], except we consider two conflict situations: if 
some BDD becomes 0, or if two BDDs are the negations of each other. 

For N = |fi| +---+|fn| the procedure UnitProp(fi,..., fn) can be per- 
formed in time O(N?). In line 5, if fj and A U(f:) share no variables, then 
fi = fila uga otherwise the BDD for fj|, u(s;) can be constructed in time 
O(|f;|) and further | fj], ugal < |f;|- This procedure is correct: “conflict” is only 
returned when /\;_, fi is unsatisfiable (see the extended paper for the proof). 


Proposition 3. If UnitProp(fi,..., fn) returns “conflict” then fi ^-^ fn = 0. 


UnitProp generalizes the usual unit propagation procedure on a formula: if 
C is a clause, then U(C) # 0 implies C is a unit clause and Aye yc)! = C. We 
extend the relation F; and the definition of RUP accordingly. 


Definition 5. Let fi,...,fn and g 40 be BDDs. Then fi ^+- ^ fn implies g 
by RUPgpp if UnitProp(filag..--,fn|ag) returns “conflict.” 


Example 2. Let F = {C, = bV c, C2 = a V b, C3 = a V c}, and assume a ~ b < c. 
Consider g as shown in Figure 3, expressing the cardinality constraint g(r) = 1 
if and only if 7 satisfies at least two a, b, c; also written {a,b,c} > 2. Figure 3 
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(BV c)|Ag (a V b)|Ag (a V c)|~g 
4 b 
rA 
te o + i 
0 1 
Unit: ~a 
1 0 1 0 f 1 
nit: b 
1 1 | o | 
conflict 
(a) The constraint g (b) UnitProp((b V c)|~g, (a V b) |g, (a V c)|=9) 


Fig. 3: Example derivation of a constraint g, shown in (a), using RUPppp. In (b), 
the top line shows the BDDs for each of the clauses (b V c), (a V c), (a V b) after 
cofactoring by g. The second line shows each of these BDDs after cofactoring 
by the unit ~a € U((b V c)|~g). Here, the middle BDD becomes simply the unit 
b, and the third line shows each BDD cofactored by the unit b. In this line, the 
third BDD has become 0, so a conflict is returned. 


shows the updates made throughout UnitProp(C1|~g; C2|~g; C3l~g). Notice that 
U(Cy|ag) = {ma}, and U((C|=,)|s2) = {b}. Then C3|., after cofactoring by 
~a and b becomes the constant BDD 0, so the procedure returns “conflict.” As 
a result, F implies the BDD g by RUPgpp. 


We show that RUPgppp is a redundancy property. Given BDDs fi,..., fn, g, 
checking whether g is implied by RUPgpp primarily consists of the UnitProp 
procedure, though each f;|,, must first be constructed, which can be done in 
time O(|f;|-|g|) [21]. The size of this BDD may in some cases be larger than the 
size of fi, though it is typically smaller [21,63] and at worst |fi|<9| < |fil - lgl- 
Consequently it can be decided in time O(|g|? - N?) whether g is implied by 
RUPgpp. Finally if g is implied by RUPgpp then it is redundant with respect to 
fi^- -A fn; in fact, it is a logical consequence (proof of the following is available 
in the extended paper). 


Proposition 4. If fi ^A fn Fig, then f^ Afan Fg. 
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From RUPgpp the property PR can be directly generalized to this setting as 
well. Specifically, we define the redundancy property PRgpp as follows. 


Definition 6. Suppose fi,..., fn are BDDs and g is a non-constant BDD. Then 
g is PRapp with respect to N; fi if there is partial assignment {l1,..., 1x} such 


that gju =1 and Nia filag Fi Silw for all 1 <j <n, where w = Ao Ij. 


Proposition 2 shows if g is PRgpp with respect to f = fi A++- A fn then g is 
redundant with respect to f, thus PRgpp is a redundancy property. 

Notice these properties and derivations directly generalize their clausal equiv- 
alents; for example, if C is PR with respect to a formula F, then (the BDD 
expressing) C is PRgpp with respect to (the set of BDDs expressing) F. Decid- 
ing whether a clause C is PR with respect to a formula F is NP-complete [37]. 
As PRgpp generalizes PR, then PRgpp is NP-hard as well. Further, checking 
whether g is PRgpp with respect to fi A++- A fn by some candidate w can be 
done polynomially as argued above, thus the following holds. 


Proposition 5. Deciding whether g is PRppp with respect to fi \---A fn, given 
the BDDs g, fi,---, fn, is NP-complete. 


In other words, the decision problems for PR and PRgpp are of equal complexity. 

The properties RUPgpp and PRgpp as defined in this section can be used 
to show that a BDD can be added to a set of BDDs in a satisfiability-preserving 
way. Of course, any clause has a straightforward and simple representation as a 
BDD, so that a formula can be easily represented this way as a set of BDDs. As 
a result RUPgpp and PRgpp can be used as systems for refuting unsatisfiable 
formulas. In the following, we identify a clause with its representation as a BDD, 
and a formula with its representation as a set of such BDDs. 

To simplify the presentation of derivations based on RUPgpp and PRgpp 
we introduce an additional redundancy property, allowing derivations to include 
steps to directly derive certain BDDs path-wise in the following way. 


Definition 7. fı ^+ A fn implies g by RUPpatn if (1) fi Att A fn Fi me for 
every C= 1, A- Alm such that l,...,lm is a path from the root of g to the 0 
terminal, and (2) |g| < loga (| fil +--+ +|fnl)- 


If fi A---A fn implies g by RUPpath then it is a logical consequence of f,A---A fn, 
as this checks that no assignment satisfies both ~g and f1 A---A fn. The number 
of paths in a BDD g can however be exponential in |g|, as in the BDD for an XOR. 
constraint, so the second condition ensures RUPpath is polynomially-checkable. 

The property RUPpath is primarily useful as it allows the derivation of a 
BDD g whose representation as a set of clauses is included in {fi,..., fn}: if c 
corresponds to a path to 0 in g, the clause ~c is included in the direct clausal 
translation of g. In this context, the restrictive condition (2) in Definition 7 can 
in fact be removed, since the number of paths in g is then at most n. 


Definition 8. A sequence of BDDs g1,...,gn is a RUPgppp derivation from 
a formula F if F A ae gi implies gk by RUPppp, or by RUPpath, for all 
1<k<n.A sequence of BDD and assignment pairs (g1,W1),---,(Gn,Wn) is 
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a PRgpp derivation from a formula F if F^ No 


wp is a PRgpp-witness for gp with respect to FA N 


gi implies gp by RUPpath, or 
E gi, foralll<k<n. 
As RUPgpp, RUPpath, and PRgpp are redundancy properties, any RUPgpp or 
PRgpp derivation corresponds to a redundancy sequence of the same length. 


Example 3. Consider the formula F = {aVb,aVc,bVc,aVd,bV d,cV d} and let 
g be the BDD such that g(7) = 1 if and only if 7 satisfies at least 3 of a, b,c, d; 
that is, g is the cardinality constraint {a,b,c,d} > 3. As seen in Example 2, the 
constraint gı = {a,b,c} > 2 is RUPgpp with respect to F; similarly so are the 
constraints, g2 = {a,c,d} > 2, and g3 = {b,c,d} > 2. Now, ~a € U(g3|4,): for 
any T the assignment 7 ,(T) satisfies at most 2 of a,b,c,d, and if a is one of 
them then 7 ,(7) surely falsifies g3. As a result, (g3|9)|a = 0. In a similar way 
ab € U(g2|~g). Since gil, cofactored by the units ~a and =b is falsified, then 
UnitProp(9i |g, 92|-9, 93|-9) returns “conflict.” Consequently g is RUPgpp with 
respect to F A gı A go A g3, and g1, 92,93, g is a RUPgpp derivation from F. 


This example can be generalized to show that RUPgpp is capable of expressing 
an inference rule for cardinality constraints called the diagonal sum [40]. For 
L={h,..-,ln} let Li = L \ {l;}; the diagonal sum derives L > k +1 from the 
set of all n constraints L; > k. 

While the properties and refutation systems RUPgppp and PRgpp easily ex- 
tend their clausal counterparts, it is important to notice that redundancy-based 
systems using BDDs can be defined in other ways. For instance, say /\j_, fi im- 
plies g by IMPpair if filag A fj|4g = 0 for some i, j. Then IMP air is polynomially 
checkable, computing the conjunction for each pair i, 7. Moreover, it is clear that 
fi A fo E g if and only if fi A f2 implies g by IMP,air- As many logical inference 
rules have this form, it is possible that systems based on IMPpair are very strong. 


5 Gaussian Elimination 


Next, we show how the Gaussian elimination technique for simplifying XOR 
constraints embedded in a formula is captured by the redundancy properties 
defined in the previous section. Specifically, if an XOR constraint X is derivable 
from a formula F by Gaussian elimination, we show there is a RUPgpp derivation 
from F including the BDD expressing X with only a linear size increase. 

An XOR clause |[x1,...,2n]? expresses the function f : BY —> B, where 
V = {xz1,..., £n} and p is 0 or 1, such that f(r) = 1 if and only if the number 
of x; € V satisfied by 7 is equal modulo 2 to p. In other words, p expresses the 
parity of the positive literals x; an assignment must satisfy in order to satisfy 
the XOR clause. As [x,y,y]? and [a]? express the same function, we assume no 
variable occurs more than once in an XOR clause. Notice that []° expresses the 
constant function 1, while []' expresses 0. 

The Gaussian elimination procedure begins by detecting XOR clauses en- 
coded in a formula F. The direct encoding D(X) of X = [x1,...,%n]? is the 
collection of clauses of the form C = {l,,...,l,}, where each l; is either x; or 
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~z; and the number of negated literals in each C is not equal modulo 2 to p The 
formula D(X) expresses the same function as X, containing the clauses prevent- 
ing each assignment over the variables in X not satisfying X. As a result, D(X) 
implies the BDD expressing X by RUPpatn (see the extended paper for proof). 


Lemma 2. D(X) implies X by RUPpath, for X = [x1,...,2y]?. 


Similar to the approach of Philipp and Rebola-Pardo [56], we represent 
Gaussian elimination steps by deriving the addition X 6 Y of XOR. clauses 
X = [@1,...,0m,21,---,2r]? and Y = [yi,..., Yn, 21,---, 2r]%, given by: 


XEY = |z1,..., Em, Y1,- -, Yn PP.. 


The following lemma shows that X @Y is RUPgpp with respect to X AY; that 
is, if a RUPgpp derivation includes X and Y then X @Y can be derived as well. 
This is a result of the following observation: while the precise cofactors of X and 
Y by =(X @ Y) depend on the variable order <, they are the negations of one 
another (proof is included in the extended paper). 


Lemma 3. Let v be the <-greatest variable in occurring in exactly one of X 
and Y, and assume v occurs in Y. Then X|-(x@y) = X, and Y|-(xey) = 7X. 


The above lemma shows that the procedure UnitProp(X|,x@y,Y|axey) re- 
turns “conflict” immediately, and as a result X @ Y is RUPgpp with respect to 
fid-:::-AfnAX AY for any set of BDDs fi,..., fn- 

Define a Gaussian elimination derivation J from a formula F as a sequence of 
XOR clauses IJ = X1,..., Xy, such that for all 1 <i < N, either X; = Xj OX; 
for j,k < i, or D(X;) C F. The size of the derivation is || = EG Si, where 
8; is the number of variables occurring in X;. We show that JI corresponds to a 
RUPgpp derivation with only a linear size increase. This size increase is a result 
of the fact that the BDD expressing an XOR clause X = [x1,...,%n]? has size 
2n + 1 (proof of the following theorem is in the extended paper). 


Theorem 3. Suppose IT = X),...,Xy is a Gaussian elimination derivation 
from a formula F. Then there is a RUPgpp derivation from F with size O(|II|). 


A consequence of this theorem is that RUPgpp includes short refutations 
for formulas whose unsatisfiability can be shown by Gaussian elimination. More 
precisely, suppose a formula F includes the direct representations of an unsat- 
isfiable collection of XOR clauses. Then there is a polynomial-length Gaussian 
elimination derivation of the unsatisfiable XOR clause []' from F [62], and by 
Theorem 3, a polynomial-length RUPgpp derivation of the unsatisfiable BDD 0. 

Notably, RUPgpp then includes short refutations of, for example, the Tseitin 
formulas, for which no polynomial-length refutations exist in the resolution sys- 
tem [64,66]. This limitation of resolution holds as well for the clausal RUP system, 
without the ability to introduce new variables, as it can be polynomially simu- 
lated by resolution [9,25]. As the translation into RUPgpp used to prove Theo- 
rem 3 introduces no new variables, this demonstrates the strength of RUPgpp 
compared to resolution and its clausal analog RUP. 
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Fig. 4: Usage of the tool dxddcheck, showing an example formula and refutation. 


6 Results 


To begin to assess the practical usefulness of the systems introduced in Section 4, 
we have implemented in Python a prototype of a tool called dxddcheck! for 
checking refutations in a subset of RUPppp. In particular we focus on the result 
of Section 5, that Gaussian elimination is succinctly captured by RUPgpp. 

We ran the SAT solver Lingeling (version bcp) on a collection of crafted 
unsatisfiable formulas, all of which can be solved using Gaussian elimination. 
From Lingeling output we extract a list of XOR clause additions and deletions, 
ending with the addition of the empty clause, as shown in Figure 4. This list is 
passed directly to dxddcheck, which carries it out as a DRUPgpp refutation; that 
is, a RUPgpp refutation also allowing steps which remove or “delete” BDDs from 
the set. These deletion steps can be removed without affecting the correctness of 
the refutation, though their inclusion can decrease the time required for checking 
it, as is the case with DRUP and RUP. 


number of | number of | Solving || proof proof | checking 

Formula variables | clauses | time(s) || lines | size (KB) | time (s) 
rpar_50 148 394 0.1 297 7 0.34 
rpar_100 298 794 0.1 597 15 1.35 
rpar_200 598 1594 0.2 1197 35 6.67 
mchess_19 680 2291 0.0 1077 Al 4.07 
mchess_21 836 2827 0.1 1317 50 5.09 
mchess_23 1008 3419 0.1 1581 63 6.42 
urquhart-s5-b2 107 742 0.0 150 7 0.95 
urquhart-s5-b3 121 1116 0.1 150 9 1.64 
urquhart-s5-b4 114 888 0.0 150 8 1.20 


For these experiments we used a 1.8 GHz Intel Core i5 CPU with 8 GB of 
memory. The table shows the time Lingeling took to solve each formula, the 
number of lines in the constructed proof and its size, and the time dxddcheck 
took to construct and check the associated DRUPgpp proof. These benchmarks 


' Source code is available under the MIT license at http://fmv.jku.at /dxddcheck along 
with the benchmarks used and our experimental data. 
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are well-known challenging examples in the contexts of XOR reasoning and proof 
production. The rpar_ n formulas are compact, permuted encodings of two con- 
tradictory parity constraints on n variables, described by Chew and Heule [18]. 
The mchess__ n formulas are encodings of the mutilated n x n-chessboard prob- 
lem, as studied by Heule, Kiesl, and Biere [34] as well as Bryant and Heule [14]. 
The urquhart formulas [17,65] are examples of hard Tseitin formulas. 

Lingeling solved each formula by Gaussian elimination almost instantly. We 
ran Lingeling and Kissat [11], winner of the main track of the SAT competition 
in 2020, on the benchmarks without Gaussian elimination, as is required for 
producing clausal refutations, using an Intel Xeon E5-2620 v4 CPU at 2.10 
GHz. Only rpar_50 was solved in under about 10 hours, producing significantly 
larger proofs; for instance, Kissat produced a refutation of size 6911 MB. 

While methods to construct clausal proofs from Gaussian elimination have 
been proposed, most are either lacking a public implementation or are limited in 
scope [18,56]. An exception is the approach very recently proposed by Gocht and 
Nordström using pseudo-Boolean reasoning [26], with which we are interested in 
carrying out a thorough comparison of results in the future. 


7 Conclusion 


We presented a characterization of redundancy for Boolean functions, general- 
izing the framework of clausal redundancy and efficient clausal proof systems. 
We showed this can be instantiated to design redundancy properties for func- 
tions given by BDDs, and polynomially-checkable refutation systems based on 
the conjunction of redundant BDDs, including the system PRgpp generalizing 
the clausal system PR. The system PRgpp also generalizes RUPgpp, which can 
express Gaussian elimination reasoning without extension variables or clausal 
translations. The results of a preliminary implementation of a subset of RUPgpp 
confirms such refutations are compact and can be efficiently checked. 

Examples 2 and 3 show RUPgpp reasoning over cardinality constraints, and 
we are interested in exploring rules such as generalized resolution [39,40]. Other 
forms of non-clausal reasoning may be possible using BDD-based redundancy 
systems as well. We are particularly interested in exploring the property IMP pair. 

While the system RUPgpp derives only constraints implied by the conjunc- 
tion of the formula and previously derived constraints, PRppp is capable of 
interference-based reasoning [30], like its clausal analog PR; there are possibly 
novel, non-clausal reasoning techniques taking advantage of this ability. Further, 
RUPgpp and PRgpp are based on the conjunction of BDDs, though Theorem 2 
is more general and could be used for other ways of expressing Boolean functions. 
Finally we are interested in developing an optimized tool for checking proofs in 
the system PRgpp, as well as a certified proof checker. 
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Abstract. Interpretation methods constitute a foundation of termina- 
tion analysis for term rewriting. From time to time remarkable instances 
of interpretation methods appeared, such as polynomial interpretations, 
matrix interpretations, arctic interpretations, and their variants. In this 
paper we introduce a general framework, the multi-dimensional interpre- 
tation method, that subsumes these variants as well as many previously 
unknown interpretation methods as instances. Employing the notion of 
derivers, we prove the soundness of the proposed method in an elegant 
way. We implement the proposed method in the termination prover NaT T 
and verify its significance through experiments. 


1 Introduction 


Term rewriting [2] is a formalism for reasoning about function definitions or func- 
tional programs. For instance, a term rewrite system (TRS) React [7] consisting 
of the following rewrite rules defines the factorial function: 


fact(0) > s(0) fact(s(a)) > mul(s(x),fact(p(s(x)))) p(s(a)) >a 


assuming that s, p, and mul are interpreted respectively as the successor, pre- 
decessor, and multiplication functions. 

Analyzing whether a TRS terminates, meaning that the corresponding 
functional program responds or the function is well defined, has been an 
active research area for decades. Consequently, several fully automatic termi- 
nation provers have been developed, e.g., AProVE [10], Tr [20], CIME [5], 
MU-TERM [23], and NaTT [34], and have been competing in the annual Ter- 
mination Competitions (TermCOMP) [1]. 

Throughout their history, interpretation methods [25] have been foundational 
in termination analysis. They are categorized by the choice of well-founded car- 
riers and the class of functions as which symbols are interpreted. Polynomial 
interpretations [22] use the natural numbers N as the carrier and interpretations 
are monotone polynomials, i.e., every variable has coefficient at least 1. Weakly 
monotone polynomials, i.e., zero coefficients, are allowed in the dependency pair 
method [I]. Negative constants are allowed using the max operator [I5]. Gen- 
eral combinations of polynomials and the max operator are proposed in both the 
standard [37] and the dependency pair settings [9]. Negative coefficients and thus 
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non-monotone polynomials are also allowed, but in a more elaborated theoretical 
framework [I5J9]. 

These methods share the common carrier N. In contrast, matrix interpre- 
tations choose vectors over N as the carrier, and interpret symbols as 
affine maps over it. Although the carrier is generalized, matrix interpretations 
do not properly generalize polynomial interpretations, since not all polynomi- 
als are affine. This gap can be filled by improved matrix interpretations, that 
further generalize the carrier to square matrices [6], so that natural polynomial 
interpretations can be subsumed by matrix polynomials over 1 x 1 matrices. 
In arctic interpretations [I9], the carrier consists of vectors over arctic naturals 
(NU {—co}) or integers (ZU {—oo}), and interpretations are affine maps over it, 
where affinity is with respect to the max/plus semiring. 

Having this many variations would be welcome if you are a user of a ter- 
mination tool in which someone else has already implemented all of them. It 
would not be so if you are the developer of a termination tool in which you 
will have to implement all of them. Also, to ultimately trust termination tools, 
one needs to formalize proof methods using proof assistants and obtain trusted 
certifier that validates outputs of termination tools, see, e.g., IsaFoR/CeTA [BI] 
or CoLoR/Rainbow [4] frameworks. Although some interpretation methods have 
already been formalized [28]30], adding missing variants one by one would cost 
a significant effort. 

In this paper, we introduce a general framework for interpretation methods, 
which subsumes most of the above-mentioned methods as instances, namely, 
(max-)polynomial interpretations (with negative constants), (improved) matrix 
interpretations, and arctic interpretations, as well as a syntactic method called 
argument filtering [IZI]. Moreover, we obtain a bunch of previously unexplored 
interpretation methods as other instances. 

After preliminaries, we start with a convenient fact about reduction pairs, a 
central tool in termination proving with dependency pairs (Section B). 

The first step to the main contribution is the use of derivers [2433], which 
allow us to abstract away the mathematical details of polynomials or max- 
polynomials. We will obtain a key soundness result that derivers derive monotone 
interpretations from monotone interpretations (Section [4}. 

The second step is to extend derivers to multi-dimensional ones. This setting 
further generalizes (improved) matrix interpretations, so that max-polynomials, 
negative constants, and negative entries are allowed (Section 5). It will also 
be hinted that multi-dimensional derivers can emulate the effect of negative 
coefficients, although theoretical comparison is left for future work. We also show 
that our approach subsumes arctic interpretations by adding a treatment for — oo 
(Section [6). Although the original formulation by Koprowski and Waldmann 
has some trickiness, we will show that our simpler formulation is sufficient. 

As strict monotonicity is crucial for proving termination without dependency 
pairs, and is still useful with dependency pairs, we will see how to ensure strict 
monotonicity (Section (7). At this point, the convenient fact we have seen in 
Section B] becomes crucial. 
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Finally, the proposed method is implemented in the termination prover Nal T, 
and experimental results are reported (Section [8). We evaluate various instances 
of our method, some corresponding to known interpretation methods and many 
others not. We choose two new instances to integrate to the NaI T strategy. The 
new strategy proved the termination of 20 more benchmarks than the old one, 
and five of them were not proved by any tool in TermCOMP 2020. 


2 Preliminaries 


We start with order-sorted algebras. Let S = (S,C) be a partially ordered set, 
where elements in S are called sorts and E is called the subsort relation. An 
S-sorted set is an S-indexed family A = {A7 }ses such that o E 7 implies 
A? C AT. We write Al?) for the set AT! x---x A. A sorted map between 
S-sorted sets X and A is a mapping f, written f : X — A, such that x € X7 
implies f(x) € A’. 

An S-sorted signature is an S* x S-indexed family F = {F3,7}(6,r)e5*xs of 
function symbols|"] When Í E€ Feoi, on), r We say f has rank (01,...,0n) >T 
and arity n in F. We may also view sorted sets and signatures as sets: having 
a:o E€ A means a € A, and f : E —> rT € F means f € Far. 


Example 1. Consider sort Nat. We define the following {Nat }-sorted signatures: 


—N := {0:()— Nat, 1:()— Nat, 2:()— Nat, ...} 

— Ne := N U {* : (Nat, Nat) > Nat} 

= M, := N U {+ : (Nat, Nat) > Nat} 

— Nnax <= N U {max : (Nat, Nat) > Nat} 
Let us abbreviate unions of signatures by concatenations of subscripts: for in- 
stance Nyimax denotes NM, U M+ U Ngax. Next consider sorts Neg and Int with 
Nat, Neg C Int. We define the following {Nat, Neg, Int }-sorted signatures: 
Z := NU{0:() > Neg, -1:() > Neg, -2 : () > Neg, ...} 
— Z, := ZUN,U{*: (Neg, Neg) > Nat, *: (Int, Int) > Int} 
— Z, := ZUN,U{+: (Neg, Neg) > Neg, +: (Int, Int) > Int} 
— Zmax i= Z U Nmax U 

{max : (Nat, Int) + Nat, max: (Int, Nat) + Nat, max: (Int, Int) > Int} 


For an S-sorted signature F, an F-algebra (A, [-]) consists of an S-sorted set 
A called the carrier and a family [-] of mappings called the interpretation such 
that [f] : AŽ > A7 whenever f € Far. 


Example 2. We consider the following standard interpretation [.-]: 
[-2] := -2 [-1] := —1 [fo] :=0 fi] :=1 [2] := 2 
[*] (a,b) := a-b [+](a,b) := a +b |max] (a,b) := max(a, b) 
Notice that (N, [-]) is an Msmax-algebra and (Z,|-]) is a Zs+max-algebra. Here, 


the {Nat}-sorted set N is defined by N™@* := N and the {Nat, Neg, Int }-sorted 
set Z is defined by Z" := N, ZN := {0,-1,-2,...} and Z™ := Z. 


1 In the literature, sorted signatures are given more assumptions such as monotonicity 
or regularity. For the purpose of this paper, these assumptions are not necessary. 
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Sorted Terms: Given an S-sorted signature F and an S-sorted set V of variables, 
the S-sorted set T(F,V) of terms is inductively defined as follows: 


-vE T(F, V) ifve Vv’; i 
— f (sena Sn © T(F,V)? if f © Far (81,---,5n) € T(F,V)%, and 7 Ep: 


An interpretation [-] is extended over terms as follows: given a: YV > A, 
[zja := a(x) if x € V7, and [f(s1,...,5n)]a := [f]([siJa,...,[SnJa). The F- 
algebra (T (F,V),-) (which interprets f as the mapping that takes (s1,..., Sn) 
and returns f(s1,...,5n)) is called the term algebra, and a sorted map 0: V > 
T(F,V) is called a substitution. The term obtained by replacing every variable 
x by O(a) in s is thus sð. 


Term Rewriting: This paper is concerned with termination analysis for plain 
term rewriting. In this setting, there is only one sort 1, and we may identify a 
{1}-sorted set A and the set At. The set of variables appearing in a term s is 
denoted by Var(s). A context C is a term with a special variable O occurring 
exactly once. We denote by C[s] the term obtained by substituting O by s in 
C. A rewrite rule is a pair of terms l and r, written | > r, such that | ¢ V 
and Var(l) D Var(r). A term rewrite system (TRS) is a set R of rewrite rules, 
which induces the root rewrite step = and the rewrite step g as the least 


relations such that 10 a r0 and C[l0] = C|r0], for any rule 1 > r € R, 
substitution 0, and context C. A TRS R is terminating iff no infinite rewriting 
Sih Sa pr E is possible. 


The dependency pair (DP) framework [1\14J13) is a de facto standard among 
automated termination provers for term rewriting. Here we briefly recapitulate 
its essence. The root symbol of a term s = f(s51,...,5n) is f and is denoted by 
root(s). The set of defined symbols in R is Dr := {root(l) |l —r € R}. We 
assume a fresh marked symbol fË for every f € Dr, and write sË to denote the 
term f#(s1,...,5n) for s = f(s1,...,8n). A dependency pair of a TRS R is a 
rule J# + rË such that root(r) € Dr and l —> C[r] € R for some context C. The 
set of all dependency pairs of R is denoted by DP(R). A DP problem (P, R} is 
just a pair of TRSs. 


Theorem 1 ([I]). A TRS R is terminating iff the DP problem (DP (R), R) is 


finite, i.e., there is no infinite chain “> to _»* De aes, 
; , there fi a $0 Decay t0 Rr” 81 rls 


DP(R) 


A number of techniques called DP processors that simplify or decompose DP 
problems are proposed; see [13] for a list of such processors. Among them, the 
central technique for concluding the finiteness of DP problems is the reduction 
pair processor, which will be reformulated in the next section. 


3 Notes on Reduction Pairs 


A reduction pair is a pair (%, >) of order-like relations over terms with some con- 
ditions. Here we introduce two formulations of reduction pairs, one demanding 
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natural assumptions of orderings, and the other, reduction pair seed, demanding 
only essential requirements. The first formulation is useful when proving prop- 
erties of reduction pairs, while the latter is useful when devising new reduction 
pairs. We will show that the two notions are essentially equivalent: one can al- 
ways extend a reduction pair seed into a reduction pair of the former sense. 
Existing formulations of reduction pairs lie strictly in between the two. 


Definition 1 (reduction pair). A (quasi-)order pair (=,>) is a pair of a 
quasi-order = and an irreflexive relation > C X, satisfying compatibility: 
=>) C >. The order pair is well-founded if > is well-founded. 

A reduction pair is a well-founded order pair (=, >) on terms, such that both 
= and > are closed under substitutions, and = is closed under contexts. Here, a 
relation I is closed under substitutions (resp. contexts) iff s 3 t implies s0 I t0 
for every substitution 0 (resp. C[s] 3 Clt] for every context C). 


The above formulation of reduction pairs is strictly subsumed by standard 
definitions (e.g., [1[1473]), where > is not necessarily a subset of =, and com- 
patibility is weakened to either =;> C > or >;= C >. Instead, > is required to 
be transitive but this follows from our assumptions > C & and compatibility: 
>;> Cz=;> C >. On one hand, this means that we can safely import existing 
results of reduction pairs into our formulation. 


Theorem 2 (reduction pair processor ). Let (P,R) be a DP problem 
and (=, >) be a reduction pair such that PUR C Z. Then the DP problem (P, R) 
is finite if and only if (P \ >,R) is. 


Example 3. Consider again the TRS React of the introduction. Proving that 
React terminates in the DP framework boils down to finding a reduction pair 
(=, >) satisfying (considering usable rules [I]): 


p(s(x)) Z # fact*(s(x)) = fact*(p(s(x))) 


On the other hand, one may wonder whether Definition |1| might be too 
restrictive. We justify our formulation by uniformly extending general “reduction 
pairs” into reduction pairs that comply with Definition |1| This is possible for 
even more general pairs of relations than standard reduction pairs. 


Definition 2 (reduction pair seed). A well-founded order seed is a pair 
(W, S) of relations such that S is well-founded and S;W C S*. A reduction 
pair seed is a well-founded order seed on terms such that both W and S are 
closed under substitutions, and W is closed under contexts. 


Now we show that every reduction pair seed (W, S} can be extended to a 
reduction pair (7,>) such that W C = and S C >. Before that, the assumption 
S;W C S+ of Definition [2] is generalized as follows. 


Lemma 1. Jf (W,S) is a well-founded order seed, then S; W* C ST. 


Proof. By induction on the number of W steps. 
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Theorem 3. Let (W, S) be a well-founded order seed. Then (%=,>) is a well- 
founded order pair, where = := (WUS)* and > := (W*;S)*. 


Proof. It is trivial that = is a quasi-order and > C & by definition. We show the 
well-foundedness of > as follows: Suppose on the contrary we have an infinite 


sequence: 
a, W* bı S ag W* bə S a3 W* bo S- 
Then using Lemma [I] (S;W* C S+) we obtain a; W* bı ST by S* +++, which 
contradicts the well-foundedness of S. 
Now we show compatibility. By definition we have =;> C >, so it suffices to 
show +; C >. By induction we reduce the claim to >;(W US) C >, that is, 
both =; W C > and =; S C >. Using S;W C St = S;S* we have 


=W = (W*;S)t;W = (W*; S)"; W*; S; W 
C (W*; S)";W*;S;S" C > 


The other case >; 5 C > is easy from the definition. 


Now we obtain the following corollary of Theorem [2|and Theorem [3| 


Corollary 1. Let (P,R) be a DP problem and (W, S) a reduction pair seed such 
that PUR CW. Then (P,R) is finite if and only if (P \ S, R) is. 


Notice that Definition [2] does not demand any order-like property, most no- 
tably transitivity. This is beneficial when developing new reduction pairs; for 
instance, higher-order recursive path orders are known to be non-transitive, 
but form a reduction pair seed with their reflexive closure. Throughout the pa- 
per we use Definition [1] since it provides more useful and natural properties of 
orderings, which becomes crucial in Section [7] 


4 Interpretation Methods as Derivers 


Interpretation methods construct reduction pairs from F-algebras, where F is 
the {1}-sorted signature of an input TRS or DP problem, and the carrier is a 
mathematical structure where a well-founded ordering > is known. In the DP 
framework, weakly monotone F-algebras play an important role. 


Definition 3 (weakly monotone algebra). A mapping f : A1 X-X An > A 
is monotone with respect to I if f(a1,...,@i,-.-,@n) I f(ai,...,a4,...,n) 
whenever a1 E€ Aj,...,dn E An, a E€ Aj, and a; I a. A weakly monotone 
F-algebra (A, [|], >, >) consists of an F-algebra (A, |-]) and an order pair (>, >) 
such that every |f] is monotone with respect to >. 


Example 4. Continuing Example [2| (N, [-], >, >) is a weakly monotone Ms+max- 
algebra with the standard ordering (>,>). Notice that (Z,[-],>,>) is not a 
weakly monotone Zy+max-algebra, since multiplication on integers is not neces- 
sarily monotone. Nevertheless, it is a weakly monotone Zmar UN4-algebra. 
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To ease presentation, from now on we assume that F is a {1}-sorted signature, 
while G is an S-sorted signature. It is easy nevertheless to generalize our results 
to an arbitrary order-sorted signature F. 


Theorem 4 ({14]). Let (A, [|], >, >) be a weakly monotone F-algebra such that 
> is well-founded in A. Then ([>],[>]) is a reduction pair on T(F,V), where 
s [I]t :<—> Ya: V > A. [sla 3 [t]a. 


Moreover, using the term algebra any reduction pair (%,>) on 7 (F,V) can be 
seen as a well-founded F-algebra (T (F, V),-,, >). 


Example 5. Continuing Example [4] ({=], [>]) forms a reduction pair for signa- 
ture Msmax. Notice that it does not for Zimax U Ma, essentially because > is not 
well-founded in Z. 


In order to prove the finiteness of a given DP problem, we need a weakly 
monotone F-algebra for the signature F indicated by this problem, rather than 
for a predefined signature like Ms+max- We fill the gap by employing the notion 
of derivers [24]33] to derive an F-algebra from one of another signature G. 


Definition 4 (deriver). An F/G-deriver is a pair of a sort 8 € S and a map- 


ping d, such that d(f) € T(G, {x1 :6,..-,;Xn : O})® when f has arity n in F. 
Given a base G-algebra (A, [:]), we define the derived F-algebra (A°,d[-]) by 


A f] (ar, an) = [d(f)}0a > ar,- Xn > an) 
Example 6. Define a {fact}, p,s : 1 > 1}/Zmax-deriver (Nat, d) by 
d(fact*) := x, d(s) := x +1 d(p) := max(x; - 1,0) 


Note that d(p) has sort Nat, thanks to the rank (Int, Nat) > Nat of max in Zax. 
The order pair (d[>], d[>]) satisfies the constraints given in Example [3] 


Now we show that an ¥/G-deriver yields a weakly monotone F-algebra if the 
base G-algebra is known to be weakly monotone. Thus, Example [6] proves that 
React is terminating. The next result about monotonicity is folklore: 


Lemma 2. A mapping f : A” > A is monotone with respect to a quasi-order > 
if and only if a, > bi,...,an > bn implies f(ai,..., an) > f(b1,..., bn). 


Proof. The “if” direction is due to the reflexivity of >, and the “only if” direction 
is easy by induction on n and the transitivity of >. 


Then monotonicity is carried over to the interpretation of terms, in the following 
sense. For two sorted maps a: X —> A and 6: X > A, we write a > 6 to mean 
that a(x) > (x) for any x € X7 and sort ø. 


Lemma 3. Let (A,|-],>,>) be a weakly monotone G-algebra and s E€ T(G,V)°. 
Ifa > B then [sla > [s]B. 
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Proof. By structural induction on s. The claim is trivial if s is a variable. Con- 
sider s = f(51,..., Sn). We have [s;]a > [s;]8 for each i € {1,...,n} by induction 
hypothesis. With Lemma 2] and the monotonicity of [f], we conclude: 


[slo = [f]([si]a,---,[snla) > [f](Isi]8,---;[8n]8) = [s]6 


Lemma 4. Let (6,d) be an F/G-deriver and (A,|-],>,>) a weakly monotone 
G-algebra. Then (A°, d[-], 2y >) is a weakly monotone F-algebra. 


Proof. Suppose that f has arity n in F, and for every i € {1,...,n} that a;, bi € 
A? and a; > bi. Then from Lemma f3] 
d|f](a1,...,an) = [d(f)|(%1 = a1,..., Xn Œ an) 
Sl E er Cee 


With Lemma [| we conclude that every d[f] is monotone with respect to >, and 
hence (Ae, d|], >, >) is a weakly monotone F-algebra. 


Thus we conclude the soundness of the deriver-based interpretation method: 


Theorem 5. If (ô,d) is a F/G-deriver, (A, [|], >,>) is a weakly monotone G- 
algebra and > is well-founded in A®, then (d[>],d[>]) is a reduction pair. 


Proof. Immediate consequence of Lemma fjand Theorem [4] 


It should be clear that Theorem [5] with G = Zamax UN subsumes the polyno- 
mial interpretation method with negative constants Lemma 4]. Their trick 
is to turn integers into naturals by applying max(-,0), as demonstrated in Ex- 
ample|6]in a syntactic manner. Theorem |5|gives a slightly more general fact that 
one can mix max and negative constants and still get a reduction pair. As far 
as the author knows, this fact has not been reported elsewhere, although nat- 
ural max-polynomials without negative constants are known to yield reduction 
pairs [9] Section 4.1]. 

In addition, a syntactic technique known as argument filtering is also 
a special case of Theorem [5] In the context of higher-order rewriting, Kop and 
van Raamsdonk generalized argument filters into argument functions [18] Defi- 
nition 7.7], which, in the first-order case, correspond to derivers with G being a 
variant of F. In these applications, base signatures and algebras are not a priori 
known, but are subject to be synthesized and analyzed. 


5 Multi-Dimensional Interpretations 


The matrix interpretation method uses a well-founded weakly monotone al- 
gebra (N™, [-] Mat, 22, >>) over natural vectors, with an affine interpretation: 


[f] Mat (Gi, tee jn) = Cia, Spee Cran FG 


where C1,..., Cn E€ N”*™ and ce N”, and the following ordering: 
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Definition 5 ((8/19]). Given an order pair (>,>) on A and a dimension m € 
N, we define the order pair (>,>>) on A™ as follows: 


(a1, 0m) (2) Op bm) <= Ay (2) bı Aa È b2 A+++ Aam = bm 


Improved matrix interpretations [6] consider square matrices instead of vectors, 
and thus, in principle, matrix polynomials can be considered. Now we generalize 
these methods by extending derivers to multi-dimensional ones. 


Definition 6 (multi-dimensional derivers). An m-dimensional F /G-deriver 
consists of an m-tuple ô e s™ of sorts and a mapping d such that d(f) E€ 
T(G, x), where X := {xij : (5); |i c {1,...,n}, jE {1,...,m}} if f has arity 
n in F. Given a G-algebra (A,|-]), the derived F-algebra (AS, d[-]) is defined by 


=> > > 


dl f](@i,...,@n) = (A) a- [(€))_,]@) 


where a is defined by a(xi j) := (Gi), 


Example 7 ([8, Example 1]). The TRS of the single rule £(£(x)) > £(g(£(z))) 
can be shown terminating by the following 2-dimensional matrix interpretation: 


tma (5 4) 7+ h) maa= (49) a+ (9) 


The 2-dimensional {f, g}/N.-deriver ((Nat, Nat), d) defined by 


ie) = (2) io- (*) 


represents [-] Mat as d[-], that is, [>] ma = d>] and [>] ma = d>]. 


Now we prove a counterpart of Theorem |5| for multi-dimensional derivers. 
The following lemma is one of the main results of this paper, which is somewhat 
surprisingly easy to prove. 


Lemma 5. For an m-dimensional F/G-deriver (6, d) and a weakly monotone 
G-algebra (A, [|], >, >), (A, d|], >, >) is a weakly monotone F-algebra. 


Proof. Let f have arity n in F and @j,...,@n, bi, ves Dn c AS satisfy a; > bj. 
Define a and 8 by a(x; j) := (@;), and B(x; j) = (bi); By assumption we have 
a > 6, and with Lemma [B] we have 


(Gpl@.-.-.dn)) = [ 


> 


la > KED) = (AAC. 8a)), 


=> => 


for every j € {1,...,m}. Hence d[f|(@1,...,@) >> d[f](bi,...,bn), and this 
concludes the proof due to Lemma [2] 
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Theorem 6. For a multi-dimensional F /G-deriver (6, d) and a weakly mono- 
tone G-algebra (A, |], >, >) such that > is well-founded in A®:, (d{>], d>] 


is a reduction pair. 


Proof. Thanks to Lemma [5] and Theorem [4] it suffices to show that >> is well- 
founded in A®. Suppose on the contrary that there exists an infinite sequence 
äi > @ >- with G,do,... € AŤ. Then we have (a), > (@), >--: and 
(G@1),,(@2),,-..€ AG), contradicting the well-foundedness of > in AO, 


It should be clear that every m-dimensional (improved) matrix interpretation 
can be expressed as an m-dimensional (or m?-dimensional) F/N4-deriver. There 
are two more important consequences of Theorem [6] First, we can interpret 
symbols as non-affine maps even including max-polynomials; and second, since > 
is not required to be well-founded in A®)2,..., A)", examples that previously 
required non-monotone interpretations—and hence a stronger condition than 
Theorem 2]—can be handled. 


Example 8 (Excerpt of AProVE_08/log). Consider the TRS R; consisting of 


r-0>2 Oo/y0 
s(x) -s(y) > x-y s(x) / s(y) > (s(x) - s(y)) / s(y) 


which defines (for simplicity, rounded up) natural division. Proving R; termi- 
nating using dependency pairs boils down to finding a reduction pair (=, +) such 
that (again considering usable rules) 


w-OFe s(z)-s'y)=a-y s(x) /*s(y) > (s(x) -s(y)) / s(y) 


A polynomial interpretation []p,, with negative coefficients such that 


po =9 [Slp(v) = e+] "poi (a, y) =x [-]p,)(2,y) = max(z — y, 0) 


satisfies the above constraints, but one must validate the requirements of 


=> 


Theorem 11]. In our setting, an F/Z+max-deriver ((Nat,Neg),d) such that 


Jo) = a i- (£ + ') rem aur y t) = a 


yields a reduction pair satisfying the above constraints. 


The intuition here is that the two dimensional interpretation of s” (0) records 
n in the first coordinate and —n in the second. Hence, one does not have to 
reconstruct —n from n using the non-monotonic minus operation. 

It seems plausible to the author that negative coefficients can be eliminated 
using the above idea; however, the increase of the dimension leads to more free- 
dom in variables (the variable introduced to represent —n may take values other 
than that) and so the ordering over terms may be different. It is left for future 
work to investigate whether this idea always works or not. 
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6 Arctic Interpretations 


An arctic interpretation [I9] {-|.4 is a matrix interpretation on the arctic semir- 
ing; that is, every interpretation [f]4(@1,...,@n) is of the form 


where ® and © denote the matrix multiplication and matrix addition in which 
the scalar addition is replaced by the max operation, and the scalar multiplica- 
tion by addition; and entries of C; and Care arctic naturals (N_.. := NU{—oo}) 
or arctic integers (Zo. := ZU {-—oco}). In addition, must be absolute posi- 
tive: (2), > 0, so that (N x N™3", Ja, >, >) or (N x Z™5', [ Ja, >, >) forms 
a well-founded weakly monotone algebra. 

The above formulation deviates from the original [19] in two ways. First, 
we do not introduce the special relation such that —co >> —oo. Koprowski and 
Waldmann demanded this to ensure closure under general substitutions, but 
such a comparison cannot occur as we only need to consider substitutions that 
respect the carrier N x DO: Second, for arctic natural interpretations they relax 
absolute positiveness to somewhere finiteness: (€), # —oo or (Ci); , # —2 for 
some i. However, the two assumptions turn out to be equivalent. 


Proposition 1. Every arctic natural interpretation of form is absolute pos- 
itive iff it is somewhere finite. 


Proof. Clearly, absolute positiveness implies somewhere finiteness. For the other 
direction, since (€); Æ —oo trivially implies absolute positiveness, suppose that 


(€)1 = —oo and (C;)1,1 4 —oco for some i. We then know (7); > 0, where 
y := C1 Q 71 D- Cn Q Zn. Hence, by d := (0, (€)2,..., (Em), we have 
[F] A(T,- --;Zn) = YE c, and this representation is absolute positive. 


One can easily obtain arctic interpretations via multi-dimensional derivers: 
consider a sort ANat with Nat C ANat and {Nat, ANat}-sorted signature Mimax-o, 
extending Nimax with 


-o : () > ANat +: (ANat, ANat) > ANat 
max : (Nat,ANat) > Nat max: (ANat,Nat)— Nat max: (ANat, ANat) > ANat 


and extend the standard interpretation [|-| accordingly. We omit the easy proof 
of the following fact and the counterpart for arctic integer interpretations. 


Proposition 2. Every absolute positive arctic natural interpretation |] 4 is rep- 
resented as d|] via an F/Nmax-w-deriver ((Nat, ANat,..., ANat), d). 


Notice that, in practice, this requires us to deal with —oo by ourselves since 
there is no standard SMT theory [3] that supports arithmetic with —oo. 
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7 Strict Monotonicity 


Before the invention of dependency pairs [I], strictly monotone algebras were 
necessary for proving termination by interpretation methods, and they constitute 
a sound and complete method for proving termination of TRSs. 


Definition 7. A strictly monotone F-algebra is a weakly monotone F-algebra 
(A, |], >, >) such that (A,|-]) is monotone with respect to both > and >. 


Theorem 7 (cf. [36]). A TRS R is terminating if and only if there is a strictly 
monotone well-founded F-algebra (A, [|], >, >) such that RC [>]. 


Moreover, strict monotonicity is a desirable property in the DP framework as it 
allows one to remove not only dependency pairs but also rewrite rules. 


Theorem 8 ([{12]). A DP problem (P, R) is finite if (P \ [>], R \ [>]) is, where 
(A, [], >, >) is a strictly monotone well-founded F-algebra such that PUR C [>]. 


We now state a criterion that ensures the strict monotonicity of multi- 
dimensional interpretation obtained via derivers. Below we write d; to mean 
the mapping defined by dj(f) := (d(f)),. 


Theorem 9. Let (6, d) be an m-dimensional F/G-deriver and (A, [],>,>) a 
weakly monotone G-algebra. Suppose that when f has arity n in F andi € 
{1,... n}, a(%,1) >a implies [di (f)la > [di(f)la(%i1- a) for anya: X => A 
anda € A. Then (AP, d{-], >, >) is a strictly monotone F-algebra. 


Proof. We only prove strict monotonicity as we already know weak monotonicity 
by Lemma|5 So suppose that f has arity n in F, @1,...,@;,...,@n, a, € A? and 
a; > a’. For the first coordinate, define a by a(x,,;) := (G);. Then, first using 
the assumption, and then Lemma B] we conclude 


di[f](G@1...,Gi,..-,Gn) = [di (f)]a 
> [di(f)lo(xi1 => (ã;)1) 
> [d (f)]a(xi1 > (G)1,%i,2 > (G) Xim +> (@;)m) 
= dh [f] (ē,..., 8., ēn) 


For the other coordinates, thanks to the “new” assumption > C > in m 


we have @; > @,. Then the weak monotonicity ensures d|f](@1,...,@;,...dn) > 
d|f\(a1,...,a@,...,@,), from which we deduce for each j € {2,.. ae 
Gy Uf Gig ain Cigwte gn) Self | (@1,-- -a --- 3 Gn) 


Although the above result and proof do not look surprising, it would be worth 
noticing that the statement is false in the standard formulation allowing > Z > 
(as even in [8]). 
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Example 9. Consider the following apparently monotone matrix interpretation: 


i( (0) = Go) (a) = (a) 


If one had a; > bı but a; # bı, then 


ai((2))- (2) E e (2) > (2): 


So [f] would not be monotone with respect to >. 


8 Implementation and Experiments 


Multi-dimensional interpretations are implemented in the termination prover 
Nal T version 2.(P] using a template-based approach. 


Definition 8. An m-dimensional F/G-deriver template (6, d) with S-sorted 


= 


set W of template variables is defined as in Definition|6| but allowing d(f) € 
T(G,WU2)?. Its instance according to a substitution 0 : W— T(G,0) is the 
F /G-deriver (5,d0), defined by d0(f) := (di(f)0,...,dm(f)9). 


In the implementation, we fix G = Zmar UN and the base weakly monotone 
G-algebra (Z, |-],>,>). Given an m-dimensional deriver template (6,d) with 
W, our interest is now to find 6 : W —> Z such that dO[s] > dott] for every 
(s,t) € PUR for the DP problem (P, R} of concern, thanks to Theorem [6] 
Nal T reduces this problem into an SMT problem and passes it to a backend 
SMT solver. The page limit is not enough to detail the reduction; in short, the 
constraint d6[s] >> d6[t] is reduced into a Boolean formula over atoms of form 
a * (v1, i1) Bea R (Unin) > b* (v1, i1) ies (Dry Gag) where a,b € T(G,W), 
and (v1,%1)---;(Un;in) E€ (Var(s) U Var(t)) x {1,...,m} are seen as variables. 
Internally NaTT uses a distribution approach [30], whose soundness crucially 
relies on the fact that the only rank of * is (Nat,Nat) —> Nat in the signature 


= 


G. Then each atom is further reduced to (1) a = b if (ô); = Int for some 


j, (2) a > bif {j | OM = Neg}| is even, and (3) a < b otherwise. Due 
to the last step, having coordinates of sort Int leads to a stronger constraint 
when ordering terms. Finally, the resulting formula, containing only template 
variables, is passed to the SMT solver Z3 4.8.10 and a satisfying solution 
0: W > Z is a desired substitution. 

To verify the practical significance of the method, we evaluated various tem- 
plates in a simple dependency pair setting. For a function symbol f of arity 
n > 2, the k-th coordinate of template d( f) is chosen from 


— sum: w+ 304 (b* xik), 


? Available at https://www.trs.cm.is.nagoya-u.ac.jp/NaTT/ 


286 Akihisa Yamada 


Table 1. Evaluation of 2-dimensional templates. 


Coordinate 1 Coordinate 2 YES New Time Known as 


# 

1 sum Nat - - 512 - 00:36:12 polynomial [I] 

2 sum Int = = 559 - 00:52:37 negative constant 
3 sum-sum Nat sum-sum Nat 636 = 04:18:05 matrix 

4 sum-sum Int sum Neg 602 10 04:00:05 new 

5 sum-sum Int sum-sum Int 542 0 25:07:04 new 

6 sum-sum Int max Neg 585 8 14:58:41 new 

7 max Int 3 - 560 - 00:58:58 ee 
8 max-max Nat max-max Nat 552 3 12:33:43 arctic natural 
9 max-max Int max-max Int 580 2 22:35:29 arctic integer [T9]* 
10 max-max Nat sum Nat 577 0 03:48:46 new 
11 max-max Int sum Neg 584 2 06:53:34 new 
12 max-sum Int sum Neg 592 4 06:59:22 new 
13 heuristic Int sum Neg 648 9 04:55:43 new 


. n 
— max: max!_, b * (w+ xik), 
n m 

— sum-sum: w +}; ar b* Xij, 
Z Ş nm m PIET: 

max-max: maxj!_, max’ , b * (w + xij), 
= n m 7 

sum-max: `; max7!, Us (w+ Xii) 
— max-sum: max; (w + )7;_,* xij), and 
— a heuristic choice between sum-sum and max-sum, 


where b and w introduce fresh template variables, b ranges over {0,1} and the 
sort of w is up to further choice. The sort of the first coordinate is turned to Nat 
by applying max(-,0) if necessary. 

Experiments are run on the StarExec environment [29], with timeout of 300 
seconds. The benchmarks are the 1507 TRSs from the TRS Standard category 
of the termination problem database 11 [32]. Due to the huge search space, we 
evaluate templates of dimensions up to 2. A part of the results are summarized 
in Table |1| Full details of the experiments are made available at 
trs.cm.is.nagoya-u.ac. jp/NaTT/multi/ 

In the table, each coordinate is represented by the template and the sort 
of w. In terms of the number of successful termination proofs indicated in the 
“YES” column, the classical matrix interpretations (row #3) are impressively 
strong. Nevertheless, it is worth considering a negative coordinate (#4) as it 
gives 10 termination proofs that the previous version of NaTT could not find, 
indicated in the “New” column. In contrast, considering whole integers in the 
second coordinate (#45) does not look promising as the runtime grows signifi- 
cantly. Concerning “max”, we observe that its use in the second coordinate (#6) 


3 This template is a subset of integer max-polynomials [9], although the fact that: it 
yields a reduction pair is new. 

4 Tn our implementation, negative infinity is not supported. Instead, similar effect is 
emulated by zero coefficients. 
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Table 2. Experiments with combined strategies 


Strategy YES New to NalT New to TermCOMP Time 

Old Strategy 861 0 0 3:46:12 
With #4 874 13 3 4:14:09 
With #13 871 10 1 4:26:14 
With #4 and #13 881 20 5 4:49:50 


degrades the performance. Using “max” in both coordinates a la arctic inter- 
pretations (#8, #9) gives a few new termination proofs, but the impact in the 
runtime is significant in the current implementation. The runtime improves by 
replacing some occurrences of “max” by “sum” (#10-12), while the power does 
not seem defected. In terms of the number of termination proofs, the heuristic 
choice of “sum-sum” and “max-sum” in the first coordinate (#13) performed 
the best among the evaluated templates. 

From these experiments, we pick templates #4 and #13 to incorporate in the 
NaT T default strategy. The final results are summarized in Table[] Although the 
runtime noticeably increases, adding both #4 and #13 gives 20 more examples 
solved, and five of them (AProVE_09_Inductive/log and four in Transformed_ 
CSR_04/) were not solved by any tool in the TermCOMP 2020. 


9 Conclusion 


In this paper we introduced a deriver-based multi-dimensional interpretation 
method. The author expects that the result makes the relationships between 
existing interpretation methods cleaner, and eases the task of developing and 
maintaining termination tools. Moreover, it yields many previously unknown 
interpretation methods as instances, proving the termination of some standard 
benchmarks that state-of-the-art termination provers could not. 

Theoretical comparison with negative coefficients is left for future work, and 
the use of —oo is not implemented yet. Also since this work broadens the search 
space, it is interesting to heuristically search for derivers rather than fixing some 
templates. Derivers of higher dimensions seem also interesting to explore. Finally, 
although the proposed method is implemented in the termination prover Nal T, 
there is no guarantee that the implementation is correct. In order to certify 
termination proofs that use multi-dimensional derivers, one must formalize the 
proofs in this paper, extend the certifiable proof format [27], and implement a 
verified function to validate such proofs. 
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Abstract. Logic-based approaches to AI have the advantage that their 
behavior can in principle be explained to a user. If, for instance, a 
Description Logic reasoner derives a consequence that triggers some 
action of the overall system, then one can explain such an entailment by 
presenting a proof of the consequence in an appropriate calculus. How 
comprehensible such a proof is depends not only on the employed calculus, 
but also on the properties of the particular proof, such as its overall size, 
its depth, the complexity of the employed sentences and proof steps, etc. 
For this reason, we want to determine the complexity of generating proofs 
that are below a certain threshold w.r.t. a given measure of proof quality. 
Rather than investigating this problem for a fixed proof calculus and a 
fixed measure, we aim for general results that hold for wide classes of 
calculi and measures. In previous work, we first restricted the attention 
to a setting where proof size is used to measure the quality of a proof. 
We then extended the approach to a more general setting, but important 
measures such as proof depth were not covered. In the present paper, we 
provide results for a class of measures called recursive, which yields lower 
complexities and also encompasses proof depth. In addition, we close 
some gaps left open in our previous work, thus providing a comprehensive 
picture of the complexity landscape. 


1 Introduction 


Explainability has developed into a major issue in Artificial Intelligence, particu- 
larly in the context of sub-symbolic approaches based on Machine Learning [6]. 
In contrast, results produced by symbolic approaches based on logical reasoning 
are “explainable by design” since a derived consequence can be formally justified 
by showing a proof for it. In practice, things are not that easy since proofs may 
be very long, and even single proof steps or stated sentences may be hard to com- 
prehend for a user that is not an expert in logic. For this reason, there has been 
considerable work in the Automated Deduction and Logic in AI communities on 
how to produce “good” proofs for certain purposes, both for full first-order logic, 
but also for decidable logics such a Description Logics (DLs) [9]. We mention here 
only a few approaches, and refer the reader to the introduction of our previous 
work [2] for a more detailed review. 


© The Author(s) 2021 
A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 291-308, 2021. 
https: //doi.org/10.1007/978-3-030-79876-5__17 
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First, there is work that transforms proofs that are produced by an automated 
reasoning system into ones in a calculus that is deemed to be more appropriate 
for human consumption [11,22,23]. Second, abstraction techniques are used to 
reduce the size of proofs by introducing definitions, lemmas, and more abstract 
deduction rules [16,17]. Justification-based explanations for DLs [10, 14,28] can 
be seen as a radical abstraction technique where the abstracted proof consists of 
a single proof step, from a minimal set of stated sentences that implies a certain 
consequence directly to this consequence. Finally, instead of presenting proofs in 
a formal, logical syntax, one can also try to increase readability by translating 
them into natural language text [12, 25-27] or visualizing them [5]. 

The purpose of this work is of a more (complexity) theoretic nature. We want 
to investigate how hard it is to find good proofs, where the quality of a proof is 
described by a measure m that assigns non-negative rational numbers to proofs. 
More precisely, as usual we investigate the complexity of the corresponding 
decision problem, i.e., the problem of deciding whether there is a proof P with 
m(P) < q for a given rational number q. In order to abstract from specific logics 
and proof calculi, we develop a general framework in which proofs are represented 
as labeled, directed hypergraphs, whose hyperedges correspond to single sound 
derivation steps. To separate the complexity of generating good proofs from the 
complexity of reasoning in the underlying logic, we introduce the notion of a 
deriver, which generates a so-called derivation structure. This structure consists 
of possible proof steps, from which all proofs of the given consequence can be 
constructed. Basically, such a derivation structure can be seen as consisting of all 
relevant instantiations of the rules of a calculus that can be used to derive the 
consequence. We restrict the attention to decidable logics and consider derivers 
that produce derivation structures of polynomial or exponential size. Examples 
of such derivers are consequence-based reasoners for the DLs EL [7,21] and 
ELT [9,18], respectively. In our complexity results, the derivation structure is 
assumed to be already computed by the deriver,! i.e., the complexity of this 
step is not assumed to be part of the complexity of computing good proofs. 
Our complexity results investigate the problem along the following orthogonal 
dimensions: we distinguish between (i) polynomial and exponential derivers; and 
(ii) whether the threshold value q is encoded in unary or binary. The obtained 
complexity upper bounds hold for all instances of a considered setting, whereas 
the lower bounds mean that there is an instance (usually based on EL or ELT) 
for which this lower bound can be proved. 

In our first work in this direction [2], we focused our attention on size as 
the measure of proof quality. We could show that the above decision problem 
is NP-complete even for polynomial derivers and unary coding of numbers. 
For exponential derivers, the complexity depends on the coding of numbers: 
NP-complete (NExpTime-complete) for unary (binary) coding. For the related 
measure tree size (which assumes that the proof hypergraphs are tree-shaped, 
i.e. cannot reuse already derived consequences), the complexity turned out to 


' The highly efficient reasoner ELK [21] for (an extension of) EL actually produces a 
derivation structure, and thus is a deriver in our sense. 
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Table 1. Overview over existing and new complexity results for deciding the existence 
of good proofs, w.r.t. polynomial/exponential derivers and unary/binary encoding of 
the bound q (known results in gray). 


Measure polynomial ela ie oo 

Size NP [2] |NP [2] NEXpPTIME [2] |NEXPTIME [2] 
Monotone recursive <P < P [Th.12]|< EXPTIME < ExpTIME [Th.12] 
-measures 

Tree size P [2] P NP [2] PSPACE [Th.17,18] 
Depth P [Th.14] |P PSPACE [Th.16] |EXPTIME [Th.14] 
Logarithmic depth |P [Cor.15]|P EXPTIME [Cor.15]|EXPTIME 


be considerably lower, due to the fact that a Dijkstra-like greedy algorithm can 
be applied. In [3], we generalized the results by introducing a class of measures 
called W-measures, which contains both size and tree size and for which the same 
complexity upper bounds as for size could be shown for polynomial derivers. 
We also lifted the better upper bounds for tree size (for polynomial derivers) 
to local V-measures, a natural class of proof measures. In this paper, we extend 
this line of research by providing a more general notion of measures, monotone 
recursive ®-measures, which now also allow to measure the depth of a proof. 
We think that depth is an important measure since it measures how much of 
the proof tree a (human or automated) proof checker needs to keep in memory 
at the same time. We analyze these measures not only for polynomial derivers, 
but this time also consider exponential derivers, thus giving insights on how 
our complexity results transfer to more expressive logics. In addition to upper 
bounds for the general class of monotone recursive ®-measures, we show improved 
bounds for the specific measures considering depth and tree size, in the latter 
case improving results from [2]. Overall, we thus obtain a comprehensive picture 
of the complexity landscape for the problem of finding good proofs for DL and 
other entailments (see Table 1). 
An extended version of this paper with detailed proofs can be found at [4]. 


2 Preliminaries 


Most of our theoretical discussion applies to arbitrary logics L = (Sc, łc) that 
consist of a set Sc of L-sentences and a consequence relation e C P(Sc) x S£ 
between £L-theories, i.e. subsets of £-sentences, and single £-sentences. We assume 
that =ç has a semantic definition, i.e. for some definition of “model”, T Ec n 
holds iff every model of all elements in 7 is also a model of 7. We also assume 
that the size |n| of an £-sentence y is defined in some way, e.g. by the number 
of symbols in 7. Since £ is usually fixed, we drop the prefix “L£-” from now on. 
For example, £ could be first-order logic. However, we are mainly interested in 
proofs for DLs, which can be seen as decidable fragments of first-order logic [9]. 
In particular, we use specific DLs to show our hardness results. 
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Fig. 1. The inference rules for E£ used in ELK [21]. 


The syntax of DLs is based on disjoint, countably infinite sets Nc and Ne of 
concept names A,B,... and role names r,s,..., respectively. Sentences of the 
DL EL, called general concept inclusions (GCIs), are of the form C E D, where 
C and D are EL-concepts, which are built from concept names by applying the 
constructors T (top), CN D (conjunction), and Ar.C (existential restriction for a 
role name r). The DL ELT extends EL by the role constructor r~ (inverse role). 
In DLs, finite theories are called TBozes or ontologies. 

The semantics of DLs is based on first-order interpretations; for details, 
see [9]. In Figure 1, we depict a simplified version of the inference rules for EL 
from [21]. For example, {A E 3r.B, B E C, 3r.C E D} E AC Disa valid 
inference in EL. Deciding consequences in EL is P-complete [7], and in ELT it is 
EXPTIME-complete [8]. 


2.1 Proofs 


We formalize proofs as (labeled, directed) hypergraphs (see Figures 2, 3), which 
are tuples (V, Æ, £) consisting of a finite set V of vertices, a finite set E of 
(hyper)edges of the form (S,d) with S C V and d E V, and a vertex labeling 
function L: V —> Sze. Full definitions of such hypergraphs, as well as related 
notions such as trees, unravelings, homomorphisms, cycles can be found in the 
extended version [4]. For example, there is a homomorphism from Figure 3 to 
Figure 2, but not vice versa, and Figure 3 is the tree unraveling of Figure 2. 


Fig. 2. An acyclic hypergraph/proof Fig. 3. A tree hypergraph/proof 


The following definition formalizes basic requirements for hyperedges to be 
considered valid inference steps from a given finite theory. 


Definition 1 (Derivation Structure). A derivation structure D = (V, E, £) 
over a finite theory T is a hypergraph that is 


— grounded, i.e. every leaf v in D is labeled by €(v) E€ T; and 
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— sound, i.e. for every (S,d) € E, the entailment {(s) | s € S} = (d) holds. 


We define proofs as special derivation structures that derive a conclusion. 


Definition 2 (Proof). Given a conclusion n and a finite theory T, a proof for 
T Hn is a derivation structure P = (V, E, £) over T such that 


— P contains exactly one sink vn E V, which is labeled by n, 
— P is acyclic, and 
— every verter has at most one incoming edge, i.e. there is no verter w E€ V s.t. 


there are (S1, w), (S2,w) E E with S1 Æ S2. 


A tree proof is a proof that is a tree. A subproof S of a hypergraph H is a 
subgraph of H that is a proof s.t. the leaves of S are a subset of the leaves of H. 


The hypergraphs in Figures 2 and 3 can be seen as proofs in the sense of 
Definition 2, where the sentences of the theory are marked with a thick border. 
Both proofs use the same inference steps, but have different numbers of vertices. 
They both prove A E BNJr.A from T = {AE B, B E r.A}. The second proof 


is a tree and the first one a hypergraph without label repetition. 


Lemma 3. Let P = (V, E, £) be a proof for T =n. Then 


1. all paths in P are finite and all longest paths in P have vy as the target; and 
2.7 Et: 


Given a proof P = (V, E, £) and a vertex v € V, the subproof of P with sink v 
is the largest subgraph P, = (Vp, Ev, 2) of P where V, contains all vertices in V 
that have a path to v in P. 


2.2 Derivers 


In practice, proofs and derivation structures are constructed by a reasoning 
system, and in theoretical investigations, it is common to define proofs by means 
of a calculus. To abstract from these details, we use the concept of a deriver as 
in [2], which is a function that, given a theory 7 and a conclusion 7, produces 
the corresponding derivation structure in which we can look for an optimal proof. 
However, in practice, it would be inefficient and unnecessary to compute the 
entire derivation structure beforehand when looking for an optimal proof. Instead, 
we allow to access elements in a derivation structure using an oracle, which we 
can ask whether given inferences are a part of the current derivation structure. 
Similar functionality exists for example for the DL reasoner ELK [19], and may 
correspond to checking whether the inference is an instance of a rule in the 
calculus. Since reasoners may not be complete for proving arbitrary sentences 
of L, we restrict the conclusion 7 to a subset Cz C S¢ of supported consequences. 


Definition 4 (Deriver). A deriver D is given by a set Ce C Sz and a function 
that assigns derivation structures to pairs (T,n) of finite theories T C Sc and 
sentences n E€ Cr, such that T = n iff D(T,n) contains a proof for T = n. A 
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: at eee 
CR1 KEA if A€ K and K appears in T 
MCAforalACK,KEC . | a 

CR2 MCC if M appears in T’ 
CR MCarb LEYT .A CRA LOAM LCVr.A 
2 MEA LE 3r (MNA) 


Fig. 4. The inference rules for ELZ [9]. Given a finite theory 7 in a certain normal form, 
the rules produce a saturated theory 7’. Here, K, L, M are conjunctions of concept 
names, A is a concept name, C is an ELT concept of the form A, 3r.M, or Vr.A, and r 
is a role name or the inverse of a role name. In this calculus conjunctions are implicitly 
viewed as sets, i.e. the order and multiplicity of conjuncts is ignored. 


proof P for T = 7 is called admissible w.r.t. D(T,n) if there is a homomorphism 
h: P + @D(T,n). We call D a polynomial deriver if there exists a polynomial p(x) 
such that the size of D(T,n) is bounded by p(|T| + |n|). Exponential derivers are 
defined similarly by the restriction |D(T,n)| < 22T) 


ELK is an example of a polynomial deriver, that is, for a given EL theory T 
and EL sentence n, ELK(7,17) contains all allowed instances of the rules shown 
in Figure 1. As an example for an exponential deriver we use ELI, which uses 
the rules from Figure 4 and is complete for ELZ theories and conclusions of the 
form AC B, A, B € Nc. The oracle access for a deriver D works as follows. Let 
D = (V, E, £) := D(T, n) and V = {v1,..., Um}. D is accessed using the following 
two functions, where 7,71,..., 7; are indices of vertices and a is a sentence: 


true if ({va,.--, te, by ty) E€ E, 
false otherwise; 


true if (v;i) =a, 
false otherwise. 


In this paper, we focus on polynomial and exponential derivers, for which we 
further make the following technical assumptions: 1) D(7,7) does not contain 
two vertices with the same label; 2) the number of premises in an inference is 
polynomially bounded by |7| and |7|; and 3) the size of each label is polynomially 
bounded by |7| and |7|. While 1) is without loss of generality, 2) and 3) are not. 
If a deriver does not satisfy 2), we may be able to fix this by splitting inference 
steps. Assumption 3) would not work for derivers with higher complexity, but 
is required in our setting to avoid trivial complexity results for exponential 
derivers. We furthermore assume that for polynomial and exponential derivers, 
the polynomial p from Definition 4 bounding the size of derivation structures is 
known. 
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3 Measuring Proofs 


To formally study quality measures for proofs, we developed the following defini- 
tion, which will be instantiated with concrete measures later. Our goal is to find 
proofs that minimize these measures, i.e. lower numbers are better. 


Definition 5 (@-Measure). A (quality) measure is a function m: Pe > Qso, 
where Pz is the set of all proofs over L and Q>o is the set of non-negative rational 
numbers. We call m a ®-measure if, for every P € Pg, the following hold. 


[P] m(P) is computable in polynomial time in the size of P. 
[HI] Let h: P —> H be any homomorphism, and P’ be any subproof of the 
homomorphic image h(P) that is minimal (w.r.t. m) among all such sub- 
proofs having the same sink. Then m(P’) < m(P). 


Intuitively, a ®-measure m does not increase when the proof gets smaller, either 
when parts of the proof are removed (to obtain a subproof) or when parts 
are merged (in a homomorphic image). For example, mgize((V, E, 2)) := |V] is 
a ®-measure, called the size of a proof, and we have already investigated the 
complexity of the following deicision problem for Msize in [2]. 


Definition 6 (Optimal Proof). Let D be a deriver and m be a measure. Given 
a finite theory T and a sentence n E€ Cg s.t. T | n, an admissible proof P w.r.t. 
D(T,n) is called optimal w.r.t. m if m(P) is minimal among all such proofs. The 
associated decision problem, denoted OP(D,m), is to decide, given T and n as 
above and q E Qso, whether there is an admissible proof P w.r.t. D(T,ņ) with 
m(P) <q. 


For our complexity analysis, we distinguish the encoding of q with a subscript 
(unary/binary), e.g. OPunary(D, m). 

We first show that if P is optimal w.r.t. a ®-measure m and D(7, n), then the 
homomorphic image of P in D(7,7) is also a proof. Thus, to decide OP(, m) 
we can restrict our search to proofs that are subgraphs of D(T,n). 


Lemma 7. For any deriver D and ®-measure m, if there is an admissible proof 
P w.r.t. O(T,7) with m(P) < q for some q E Qso, then there exists a subproof 
Q of D(T,n) for T =| n with m(Q) < q. 


In particular, this shows that an optimal proof always exists. 


Corollary 8. For any deriver D and -measure m, if T = n, then there is an 
optimal proof for T = n w.r.t. D and m. 


Proof. By Definition 4, the derivation structure D(7,7) contains at least one 
proof for T = 7. Since D(T, n) is finite, there are finitely many proofs for T = 7 
contained in 9(7,7). The finite set of all m-weights of these proofs always has 
a minimum. Finally, if there were an admissible proof weighing less than this 
minimum, it would contradict Lemma 7. 
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3.1 Monotone Recursive Measures 


Since the complexity of OP(®,m) for ®-measures in general is quite high [2], in 
this paper we focus on a subclass of measures that can be evaluated recursively. 


Definition 9. A ®-measure m is recursive if there exist 


— a leaf function leaf: Sc + Qso and 

— a partial edge function edge,,, which maps (i) the labels (S,a) of a hyperedge 
and (ii) a finite multiset Q of already computed intermediate weights in Q>o 
to a combined weight edge,, ((S, a), Q) 


such that, for any proof P = (V, E, £) with sink v, we have 


m(P) = leaf, (C(v)) if V = {v}, 
edge, (€(S,v),{m(Py) | w E S}) if (S,v) €E. 


Such a measure is monotone if, for any multiset Q, whenever q E€ Q and 
Q' = (O\{q})U{q'} with q' < q and both edge, ((S, Q), Q') and edge, ((S, a), Q) 
are defined, then edge, ((S,a), Q’) < edgem((S,a), Q). 


Intuitively, a recursive measure m can be computed in a bottom-up fashion 
starting with the weights of the leaves given by leafm. The function edge,, is 
used to recursively combine the weights of the direct subproofs into a weight 
for the full proof. This function is well-defined since in a proof every vertex 
has at most one incoming edge. We require edge,, to be defined only for inputs 
((S ,Q), Q) that actually correspond to a valid proof in £, i.e. where S E¢ a and 
Q consists of the weights of some proofs for the sentences in S. For example, if m 
always yields natural numbers, we obviously do not need edge,, to be defined for 
multisets containing fractional numbers. 

In this paper, we are particularly interested in the following monotone recursive 
@-measures. 


— The depth Maepth of a proof is defined by 
leaf man (Q) := 0 and edgen sin, ((S,a), Q) := 1+ max Q. 


— The tree size Mtree is given by 
leafy. (0) := 1 and edgey,,. ((S:4), 9) =1+ 0. 


What distinguishes tree size from size is that vertices are counted multiple 
times if they are used in several subproofs. The name tree size is inspired by 
the fact that it can be interpreted as the size of the tree unraveling of a given 
proof (cf. Figures 2 and 3). In fact, we show in the extended version [4] that 
all recursive ®-measures are invariant under unraveling. This indicates that tree 
size, depth and other monotone recursive ®-measures are especially well-suited 
for cases where proofs are presented to users in the form of trees. This is for 
example the case for the proof plugin for Protégé [20]. 


Lemma 10. Depth and tree size are monotone recursive ®-measures. 
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Algorithm 1: A Dijkstra-like algorithm 
Input: A derivation structure D(7,7) = (V, E, £), a monotone recursive 
-measure m 
Output: An optimal proof of T = 7 w.r.t. D(T, n) and m 
1Q:=90 
2 foreach e € E do k(e) := 0 
3 foreach v € V do 
if L(v) € T then 
| P(w) := ({v},0, llt); Q := QU {v} // &(v) is in the theory 
else if (@,v) € E then 
| P(v) := ({v}, {(0,v)}, lto); Q :=QU {v} // (v) is a tautology 
else 
| P(v) := undefined 
10 while Q 40 do 


o Nyaan À 


11 choose v € Q with minimal m(P(v)) // P(v) is optimal for (v) 
a |Q= 

13 foreach e = (S, d) € E with v € S do 

14 k(e) := k(e)+ 1 

15 if k(e) = |S| then // all source vertices have been reached 
16 P := (SU {d}, e, lsutay) UU, es PCS) // construct new proof 
17 if P is acyclic then 

18 if P(d) is undefined or m(P(d)) > m(P) then 

19 | P(d):=P; Q := QU {d} // P is better for f&(d) 


20 return P(v,), where &(v,) = 7 


4 Complexity Results 


We investigate the decision problem OP for monotone recursive 6-measures. We 
first show upper bounds for the general case, and then consider measures for depth 
and tree size, for which we obtain even lower bounds. An artificial modification 
of the depth measure gives a lower bound matching the general upper bound 
even if unary encoding is used for the threshold q. 


4.1 The General Case 


Algorithm 1 describes a Dijkstra-like approach that is inspired by the algorithm 
in [13] for finding minimal hyperpaths w.r.t. so-called additive weighting functions, 
which represent a subclass of monotone recursive @-measures. The algorithm 
progressively discovers proofs P(v) for (v) that are contained in D(7, 1). If it 
reaches a new vertex v in this process, this vertex is added to the set Q. In each 
step, a vertex with minimal weight m(P(v)) is chosen and removed from Q. For 
each hyperedge e = (S,d) € E, a counter k(e) is maintained that is increased 
whenever a vertex v € S' is chosen. Once this counter reaches |S|, we know that 
all source vertices of e have been processed. The algorithm then constructs a 
new proof P for €(d) by joining the proofs for the source vertices using the 
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current hyperedge e. This proof P is then compared to the best previously known 
proof P(d) for €(d) and P(d) is updated accordingly. For Line 20, recall that we 
assumed (7,7) to contain no two vertices with the same label, and hence it 
contains a unique vertex v, with label 7. 


Lemma 11. For any monotone recursive ®-measure m and deriver D, Algo- 
rithm 1 computes an optimal proof in time polynomial in the size of D(T,n). 


Since we can actually compute an optimal proof in polynomial time in the 
size of the whole derivation structure, it is irrelevant how the upper bound q in 
the decision problem OP is encoded, and hence the following results follow. 


Theorem 12. For any monotone recursive ®-measure m and polynomial de- 
river D, OPhinary(D, M) is in P. It is in EXPTIME for all exponential derivers D. 


4.2 Proof Depth 


We now consider the measure Md¢epth in more detail. We can show lower bounds 
of P and Exp TIME for polynomial and exponential derivers, respectively, although 
the latter only holds for upper bounds q encoded in binary. 

Since our definition of OP(®,m) requires that the input entailment T = 7 
already holds, we cannot use a straightforward reduction from the entailment 
problem in EL or ELT, however. Instead, we show that ordinary proofs P for 
T H ņ satisfy m(P) < q for some q, and then extend the TBox to 7” in order to 
create an artificial proof P’ with m(P’) > q. In this way, we ensure that 7’ = 7 
holds and can use q to distinguish the artificial from the original proofs. 

For ELT, we can use an observation from [9, Example 6.29] for this purpose. 


Proposition 13 ([9]). For every q E€ Qso and ELT sentence of the form 
AC B, where A,B € Nc, one can construct in time polynomial in q an ELT 
theory T such T = AE B, and every proof for T = AC B in Ett is of depth 
larger than 24. 


We can now reduce the entailment problems for E£ and ELT to obtain the 
claimed lower bounds. 


Theorem 14. The problems OPunary(ELK, Mdepth) and OPbinary (ELI, Mdepth) are 
P-hard and ExPTIME-hard, respectively. 


Proof. For the P-hardness, we provide a LOGSPACE-reduction from the entail- 
ment problem of a GCI A E B with two concept names A, B from an EL-theory T, 
which is P-hard [9]. To reduce this problem to OPunary(ELK, Mtree), we need to 
find a theory 7’ and a number q such that 7’ = A E B holds, and moreover 
T = ACB holds iff ELK(7’, AE B) contains a proof of T’ = AE B of depth 
< q (cf. Lemma 7). 

First, observe that, since proofs must be acyclic, the depth of any proof 
of A E B from T is bounded by q := |ELK(7,A E B)|, whose size in unary 
encoding is polynomial in the size of 7. We now construct 


T =F UAE Ay Ay E Ag, os A4442 E B}, 
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where Aj,...,Ag are concept names not occurring in 7. Clearly, we have 
T’ = ACB. Furthermore, the existence of an admissible proof for 7’ = A E B 
of depth at most q is equivalent to 7 = A E B, since any proof that uses the 
new concept names must take q + 1 consecutive steps using rule Rc, i.e. must 
be of depth q + 1. Moreover, we can compute q (in binary representation) and 
output it in unary representation using a logarithmically space-bounded Turing 
machine, and similarly for 7’. Hence, the above construction constitutes the 
desired LOGSPACE-reduction. 

For the remaining result, we can use similar arguments about the exponential 
deriver ELI, where entailment is ExpTIME-hard [9]: 


— the minimal depth of a proof in an exponential derivation structure is at most 
exponential, and this exponential bound q can be computed in polynomial 
time using binary encoding; 

— by Proposition 13, there is an ELT theory T of size polynomial in the size of 
the binary encoding of q such that 7 | AC B and any proof for T = AC B 
must have at least depth q+ 1. 


To demonstrate that the generic upper bounds from Theorem 12 are tight 
even for unary encoding, we quickly consider the artificial measure Mog (depth) 
(logarithmic depth), which simply computes the (binary) logarithm of the depth of 
a given proof. This is also a monotone recursive ®-measure, since the logarithmic 
depth contains exactly the same information as the depth itself. It is easy to 
obtain the following lower bounds from the previous results about Mdepth- 


Corollary 15. OPunary (ELK, Miog(depth) ) is P-hard and OP unary(ELI, Miog(depth) ) 
is EXP'TIME-hard. 


Proof. For any deriver D, OPpinary(D, Mdepth) can be LOGSPACE-reduced to 
OPunary(D; Miog(depth)), because in order to find a proof of depth at most q (with 
q given in binary), one can equivalently look for a proof whose logarithmic depth 
is bounded by the value log q. The unary encoding of log q has the same size as 
the binary encoding of q and can be computed in LOGSPACE by flipping all bits 
of the binary encoding of q to 1. 


We now return to Mdepth and cover the remaining case of exponential derivers 
and unary encoding of the upper bound q. 


Theorem 16. OPunary(D, Mdepth) is in PSPACE for any exponential deriver D. 
It is PSPACE-hard for the exponential deriver D = ELI. 


Proof. For the upper bound, we employ a depth-first guessing strategy: we guess 
a proof of depth at most q, where at each time point we only keep one branch of 
the proof in memory. As the length of this branch is bounded by q, and due to 
our assumptions on derivers, this procedure only requires polynomial space. 
For the lower bound, we provide a reduction from the PSPACE-complete QBF 
problem (satisfiability of quantified Boolean formulas). Let Q121Qox2...Qm&@m- 
be a quantified Boolean formula, where for i € {1,...,m}, Q; € {5,V}, and ¢ is 
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a formula over {21,..., 2m}. We assume ¢ to be in negation normal form, that is, 
negation only occurs directly in front of a variable. We construct an ELT theory 
T and a number q, both of size polynomial in the size of the formula, such that 
T = AE B holds (cf. Definition 6) and T has a proof for A E B of depth q 
iff the QBF formula is valid. We use two roles r1, r2 to deal with the variable 
valuations, concept names Ao, ..., Am to count the quantifier nesting, and a 
concept name A,, for every sub-formula w of ¢. In addition, we use the concept 
names A and B occurring in the conclusion, and two concept names Bı and Bo. 
The concept name A initializes the formula at quantifier nesting level 0: 


AC Ao 
For every i € {1,...,m}, T contains the following sentence to select a truth 
valuation for x;, increasing the nesting depth in each step. 
Ai- E Jrı.(A; Az;) (1) 
A E Jrə.(A; Ar): (2) 


To ensure truth valuations are kept along the role-successors, we use the following 
sentences for every l € {x;, 74; | 1 <i <m}: 


A| L Yri. A A| E Yro. A] (3) 


The following GCIs are now used to evaluate ¢. For every conjunction Y = p1 AY2 
occurring in ¢, we use: 


Ay, g Ay E Ay, (4) 
and for every disjunction Y% = Yı V we, we use: 


Ay, 2 Ay Ap, E Ay (5) 


Finally, the following GCIs are used to propagate the result of the evaluation 
back towards the start. 


Ag CB (6) 
A; BCVr, .By A; BCYVr, -B2 B1B,sCB ifQ;=V (8) 


One can now show that there exists a proof for A E B from 7 of depth at most q 
iff the QBF formula is valid, where q is polynomial and determined by the size and 
structure of ġ. Finally, we can extend 7 with the sentences from Proposition 13 
to ensure that 7 = AC B holds while retaining this equivalence. 


4.3 The Tree Size Measure 


The tree size measure was discussed already in [2], where tight bounds were 
provided for polynomial derivers and exponential derivers with unary encoding. 
For the case of exponential derivers with binary encoding, only an EXPTIME 
upper bound was provided, and the precise complexity left open. We improve 
this result by showing that OPbinary(®, Mtree) can indeed be decided in PSPACE. 
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Fig. 5. Illustration of the argument used for Theorem 17. On the top, the partially 
guessed proof tree for two consecutive steps of the algorithm is shown, where the 
dark nodes are what is currently kept in memory. On the bottom, we see how the 
corresponding tuples are organized into a tree satisfying Conditions S1—S6. 


Theorem 17. For any exponential deriver D, OPinary(D, Mtree) is in PSPACE. 


Proof (sketch). We describe a non-deterministic procedure for OPpinary(D, Mtree), 
in polynomial space. Let 7 be a theory, 7 the goal sentence, and q a rational 
number in binary encoding. By Lemma 7, it suffices to find a proof P for T E 7 
in D(T,n) with Miree(P) < q. The procedure guesses such a proof starting from 
the conclusion, while keeping in memory a set S of tuples (7’,q’), where 1 is a 
sentence and q’ < q a rational number. Intuitively, such a tuple states: “We still 


10 


need to guess a proof for 7’ of tree size at most qg’. 


1. Initialize S := {(n,q)}. 
2. While S Æ 0, 

(a) select from S a tuple (7’,q') such that for all tuples (7, q") € S it holds 
that q” > q'; 

(b) guess a hyperedge ({v1,...,Um},v’) in D(7,7) (using the oracle access 
described in Section 2.2) and m numbers q1, ..., qm, such that (v') = y’ 
and qı +... +qm +1 < q'; and 

(c) replace (7’,q’) in S by the tuples (L(v1), q1), ---, (€(Um); dm): 


There is a proof for 7 — 7 of tree size at most q iff every step in the algorithm 
is successful. To show that it only requires polynomial space, we show that during 
the computation, the number of elements in S' is always polynomially bounded. 
For this, we show that the elements in § can always be organized into a tree 
with the following properties: 


S1 the root is labeled with €, 

S2 every other node is labeled with a distinct element from S, 

S3 every node that is not the root or a leaf has at least 2 children, 

S4 every node has at most p children, where p is the maximal number of premises 
in any inference in (7,7), which we assumed to be polynomial in the input, 

S5 every node (7’,q’) has at most 1 child (ņ”,q”) that is not a leaf and for this 
child it holds that q” < %, 
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S6 for every node labeled (n, q’) with children labeled (m, ¢1), .--; (m, qm), We 
have qı +...-+4dm <q. 


We prove this by induction on the steps of the algorithm, where in each step, we 
either replace one tuple in the tree, or put the new tuples under the leaf with 
the currently smallest value (see Fig.5). By S3 and because every number in S' is 
bounded by q, we can show that the tree has depth at most log, q, which with S4 
and S5 implies that it has at most p log, q nodes. S2 then implies that that 
|S| < p- log, q is always satisfied, and thus that S' is polynomially bounded. 


A corresponding lower bound can be found for the exponential deriver ELI 
by a reduction of the word problem for deterministic Turing machines with 
polynomial space bound. 


Theorem 18. For the exponential deriver ELI, OPpinary(ELI, Mtree) is PSPACE- 
hard. 


Proof (sketch). Let T = (Q,I,}, ©',6,q0, F) be a deterministic Turing machine, 
where Q is the set of states, I the tape alphabet, # € I’ the blank symbol, X C T 
the input alphabet, 6: Q x T Æ Q xT x {-1,0,+1} the partial transition 
function, go the initial state, and F C Q the accepting states. We assume that T 
is polynomially space bounded, that is, there is a polynomial p such that on input 
words w € X*, T only accesses the first p(|w|) cells of the tape. For a word w, 
we denote by wit] its ith letter. For some fixed word w, we construct a theory T 
using the following names, where k = p(|w)): 


— Start marks the inital and Accept an accepting configuration; 

— to denote that we are in state q E Q, we use a concept name Sq; 

for every a € T andi € {0,...,k}, we use a concept name AY} denoting that 
the letter a is on tape position i; 

for every i € {0,...,k}, we use the concept name P* to denote that the 
head is currently on position 7, and P7 to denote that it is not; 

the role r is used to express the transitions between the configurations. 


For convenience, we present the theory not in the required normal form, but 
aggregate conjunctions on the right. The following sentence describes the initial 
configuration. 


|w|-1 
Start CS, [] Ay’ MnP] aorto] er (9) 


i=0 i=|w| 


The transition from one configuration to the next is encoded with the following 
sentences for every i € {0,...,k} and every (q,a) E€ Q xT with ô(q,a) = (q’,b, d): 


Sa 0 A$ N PF E ar.Sy 1¥r.A? N Yr.Pt an [| Yr. P7 (10) 
jE{0,... k} {itd} 


A? N P> C vr.Ae (11) 


t 7 


Good Proofs for DL Entailments 305 


Finally, we use the following sentences to detect accepting configurations and 
propagate the information of acceptance back to the initial configuration 


Sy E Accept for all f € F, (12) 
Accept E Vr7 .Accept (13) 


One can find a number q exponential in k and the size of T s.t. that there is 
a proof for T | Start E Accept with tree size at most q iff T accepts w. Using 
Proposition 13, we can extend 7 to a theory 7’ s.t. J’ — Start E Accept, while 
a proof of tree size q exists iff T accepts w (observe that Mtree(P) > Mdepth(P) 
holds for all proofs P). 


5 Conclusion 


We have investigated the complexity of finding optimal proofs w.r.t. quality 
measures that satisfy the property of being monotone recursive. Two important 
examples of this class of measures, depth and tree size, have been considered in 
detail in combination with exponential and polynomial derivers. The obtained 
results are promising: given a deriver, the search for an optimal proof for an 
entailment can be easier than producing all of the proofs by this deriver. The 
algorithms used to show the upper bounds can serve as building blocks for finding 
an optimal proof w.r.t. to a monotone recursive measure automatically. 

We conjecture that weighted versions of tree size and depth, where sentences or 
inference steps can have associated rational weights, are also monotone recursive, 
and the generic upper bounds established in this paper can be straightforwardly 
applied to them. However, a more thorough study is required here, since the 
complexity of the decision problem depends on the exact way in which the 
weights are employed. This step towards weighted measures is motivated by user 
studies |1, 15,24], demonstrating that different types of sentences and logical 
inferences can be more or less difficult to understand. 
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Abstract. The application of automated reasoning approaches to De- 
scription Logic (DL) ontologies may produce certain consequences that 
either are deemed to be wrong or should be hidden for privacy reasons. 
The question is then how to repair the ontology such that the unwanted 
consequences can no longer be deduced. An optimal repair is one where 
the least amount of other consequences is removed. Most of the previ- 
ous approaches to ontology repair are of a syntactic nature in that they 
remove or weaken the axioms explicitly present in the ontology, and 
thus cannot achieve semantic optimality. In previous work, we have ad- 
dressed the problem of computing optimal repairs of (quantified) ABoxes, 
where the unwanted consequences are described by concept assertions of 
the lightweight DL EL. In the present paper, we improve on the results 
achieved so far in two ways. First, we allow for the presence of termino- 
logical knowledge in the form of an E£ TBox. This TBox is assumed to 
be static in the sense that it cannot be changed in the repair process. Sec- 
ond, the construction of optimal repairs described in our previous work 
is best case exponential. We introduce an optimized construction that 
is exponential only in the worst case. First experimental results indicate 
that this reduces the size of the computed optimal repairs considerably. 


1 Introduction 


Description Logics [3] are a well-investigated family of logic-based knowledge 
representation languages, which are frequently used to formalize ontologies for 
application domains such as biology and medicine [17]. As the size of ontolo- 
gies grows, the likelihood of them containing errors increases as well. This is 
particularly problematic if the data, stored in the ABox, are automatically ex- 
tracted from text or other sources using natural language processing or machine 
learning. The reasoning services of DL systems [22,12,33,15], which derive im- 
plicit consequences from the explicitly represented knowledge, are not only useful 
once an ontology is deployed, but can also be employed for debugging purposes 
by exhibiting consequences that are not supposed to hold in the application 
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domain. Another reason why one might want to remove a consequence is that 
it reveals private information that is supposed to be hidden [14,5]. Once such 
an unwanted consequence is detected, it is often not easy to see how to repair 
the ontology in order to get rid of this consequence. Classical repair approaches 
based on axiom pinpointing [31,29,27,32,21,8] compute maximal subsets of the 
ontology that do not have the consequence. The obtained result thus strongly 
depends on the syntactic form of the axioms. For example, it is well-known that, 
for expressive DLs, a finite set of terminological axioms can be expressed by a 
single axiom. If the given terminology (TBox) is of this shape, then the only 
possible classical repair is the empty TBox. To alleviate this problem, repair 
approaches have been developed that replace certain axioms by weaker ones (in 
the sense that they have less consequences) instead of removing them completely 
[18,24,34,6]. However, these approaches usually do not produce optimal repairs. 
In fact, it was shown in [6] that, even for the inexpressive DL EL, optimal repairs 
need not exist. The abstract example given there can be rephrased as follows. 
Assume that the TBox defines humans to be exactly those individuals that have 
a human parent, and that the ABox says that Sam is a human. After we find 
out that Sam is in fact not human [9], we want to get rid of the latter assertion, 
but keep the (correct) consequences saying that Sam has an unbounded chain 
of ancestors (of undetermined species). If the TBox is assumed to be fixed, then 
there is no optimal repair of the ABox since we can add only a finite number of 
parent assertions. 


To avoid such problems, our previous work on computing optimal repairs (for- 
mulated in the guise of achieving compliance with privacy policies) restricted the 
attention to the case without TBox. In [5] the ABox was additionally restricted 
to be a so-called instance store [19], i.e., an ABox without role assertions. The 
privacy policy (specifying which consequences are to be removed) was given as 
EL instance queries. In this setting, optimal repairs always exist and can be com- 
puted in exponential time, which is optimal since there may be exponentially 
many optimal repairs of exponential size. 


In [7] these results were extended to ABoxes with role assertions. More 
precisely, we considered quantified ABoxes in which some individuals are 
anonymized by viewing them as existentially quantified variables. For example, 
assume that the ABox contains the information that Ben has a parent, Jerry, that 
is both rich and famous, and we want to remove the consequence J parent.( RichN 
Famous)(BEN). Classical repairs can be obtained by removing one of the asser- 
tions Rich(JERRY), Famous( JERRY), and parent(BEN, JERRY). If instead 
we replace the first assertion with Rich(x) and parent(BEN, x) for an existen- 
tially quantified variable x, then we retain more consequences. Note that we could 
not have used an individual name (i.e., constant) ANNE instead of x since infor- 
mation like Rich(ANNE) about Anne does not follow from the original ABox. 
We show in [7] that in this setting all optimal repairs can be computed by an 
exponential-time algorithm with access to an NP-oracle. The oracle is needed 
since our algorithm first computes a superset of the set of optimal repairs, from 
which non-optimal ones need to be removed using the (NP-complete) entail- 
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ment test between (potentially exponentially large) quantified ABoxes. We also 
consider a modified version of entailment (called IQ-entailment) in [7], where 
quantified ABoxes are compared w.r.t. which EL instance relationships they 
imply. Using this notion, no NP-oracle is needed for computing the set of all 
IQ-optimal repairs since IQ-entailment can be decided in polynomial time. 

In the present paper, we improve on these results in two respects. On the one 
hand, we allow for the presence of terminological knowledge in the form of an 
EL TBox, which is assumed to be correct, and thus is not changed by the repair. 
To deal with a TBox, the approach from [7] for computing optimal repairs must 
be extended in two ways. First, the ABox needs to be saturated w.r.t. the TBox 
before applying our repair approach. The saturated ABox has the same conse- 
quences as the original one has together with the TBox. In our Ben and Jerry 
example, assume that the assertion Rich(JERRY ) does not belong to the original 
ABox, but the TBox contains the axiom Famous E Rich. Then the ABox on its 
own does not have the unwanted consequence Aparent.( Rich N Famous)(BEN), 
but together with the TBox it does. Saturation adds the assertion Rich( JERRY) 
to the ABox. For arbitrary TBoxes, saturation need not terminate. We consider 
two ways to remedy this problem: either allow for arbitrary TBoxes, but con- 
sider 1Q-entailment, or use classical entailment, but consider cycle-restricted 
TBoxes [1]. In both cases, saturation always terminates; in the former in poly- 
nomial and in the latter in exponential time. One might be tempted to assume 
that, after saturation, one can simply apply the repair approach of [7] unchanged. 
This is not true, however, since the TBox may re-add assertions that have been 
removed or replaced by the repair. In our example, where Rich(JERRY) is re- 
placed, but Famous( JERRY) is left untouched in the repair, the repaired ABox 
together with the TBox would still have the unwanted consequence. Thus, the 
repair approach needs to be changed to take this possibility into account. 

On the other hand, the construction of optimal repairs described in our pre- 
vious work [5,7], and extended in this paper such that it can deal with TBoxes, 
is best case exponential. The second contribution of this paper is the design of a 
new construction, both for classical and [Q-entailment, that is exponential only 
in the worst case. We also report on first experimental results, which indicate 
that this reduces the size of the computed optimal repairs considerably. 

Detailed proofs of our results can be found in [4]. 


2 Preliminaries 


Throughout this paper, we assume that X is a signature, which is a disjoint 
union of sets Xo, Xc, and XR of object names, concept names, and role names. 
We use symbols t, u, v, w to denote object names, A, B to denote concept names, 
and r,s to denote role names, all of them possibly with sub- or superscripts. 
As in [7], a quantified ABox (qABor) 1X. A over X consists of a finite subset 
X of Xo, the elements of which are called variables, and a matrix A, which is 
a finite set of concept assertions A(u) where u € Xo and A € Xc, and of role 
assertions r(u,v) where u,v E€ Xo and r € XR. An non-variable object name in 
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AX.A is called an individual name, and the set of all these names is denoted as 
(4 X.A). We further set Xo (4X. A) := 2\(AX.A)UX. Traditional DL ABoxes 
are qABoxes where X = 0; we then write A instead of 40.A. The matrix of a 
qABox is such a traditional ABox. 

An interpretation T of X is a pair (A*,-7), where the domain AF? is a non- 
empty set and the interpretation function -Z maps each u € Xo to an element u? 
of A?, each A € Xe to a set A? C AF, and each r € Dp to a binary relation rf 
over A’. The interpretation Z of X is a model of a qABox 3 X.A over X if there 
is an interpretation J such that A? = AY, the interpretation functions -7 and 
-J coincide on X \ X, and uf € AY for each A(u) € A as well as (uf, v7) € r 
for each r(u,v) € A. 

Following [7], we define EL atoms and EL concept descriptions over X by 
simultaneous induction as follows. An EL atom is either a concept name A € Xc 
or an existential restriction dr.C' for some role name r € Xr and an EL concept 
description C. An EL concept description is a conjunction [|C where C is a 
finite set of EL atoms. An EL concept inclusion is of the form C E D for EL 
concept descriptions C and D, and an EL TBoz is a finite set of such concept 
inclusions. An EL concept assertion is an expression C(u), where C is an EL 
concept description and u € Xo. 

For each interpretation Z of X, we extend the interpretation function 
EL atoms and EL concept descriptions in the following manner: 


— (Ar.C)* := {6 | there exists some y such that (ô, y) € r? and y € C7 }, 
— (MC)? := Q{ CF | C €C} where N 4 = 47. 


The interpretation Z is a model of the concept inclusion C E D (the concept 
assertion C(u)) if C? C DĒ (u? € C*), and of the TBox T if it is a model of 
each concept inclusion in 7. 

To make the syntax introduced above more akin to the one usually em- 
ployed for EL, we denote the empty conjunction []Ø as T (top concept), single- 
ton conjunctions []{C} as C, and conjunctions []C for |C| > 2 as C10... NCh, 
where C1,...,Cn is an enumeration of the elements of C in an arbitrary or- 
der. Since we do not distinguish between the singleton conjunction [ {C} and 
the atom C, each atom is also a concept description. The set Sub(C) of sub- 
concepts of an EL concept description C is defined as follows: Sub(A) := {A}, 
Sub(Ar.C) := {Ar.C} U Sub(C), and Sub([]C) = {[]C} UU{ Sub(D) | DEC}. 
The set Atoms(C) consists of all atoms contained in Sub(C). These two notions 
are extended to TBoxes and sets of concept assertions in the obvious way. 

Let a, 8 be qABoxes, concept inclusions, or concept assertions (possibly not 
both of the same kind), and 7 an EL TBox. Then we write Z } a if the 
interpretation Z is a model of a. We say that a entails 8 w.r.t. T (written 
a 7 B) if every model of a and T is a model of 3. Furthermore, a and 8 
are equivalent w.r.t. T (written a =7 2), if a H7 8 and B H7 a. In case 
T = 0, we will sometimes write | instead of °. If 30.0 E7 C E D, then 
we also write C C7’ D and say that C is subsumed by D w.r.t. T; in case 
T = we simply say that C is subsumed by D. Two E£ concept descriptions are 
equivalent w.r.t. T (written C =7 D) if they subsume each other w.r.t. T. We 
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write C C7 D to indicate that C C7 D, but C #7 D. f3 X.A |7 C(a), then 
a is called an instance of C w.r.t. 1X.A and T. For EL, the subsumption and 
the instance problem are decidable in polynomial time [2]. However, entailment 
between qABoxes is NP-complete even w.r.t. the empty TBox [7]. 

We also use the reduced form C” of EL concept descriptions C [23], which is 
obtained by removing redundant subdescriptions (see [7] for details). Adapting 
the results in [23], one can show that C = C” and that C =" D implies C" = D". 


3 <A Tale of Two Entailments 


DL-based ontologies are usually accessed through appropriate query languages, 
where for the purpose of this paper it is sufficient to assume that a query lan- 
guage is given by a fragment of first-order logic. Instead of comparing ontologies 
w.r.t. the models they have, it thus makes sense to compare them w.r.t. the 
answers to queries they entail [25]. Given such a query language QL and an EL 
TBox 7, we say that the qABox 4X.A QL-entails the qABox JY.B w.r.t. T 
(written IX.A 4, JY.B) if for each query y(x1,..., 0%) € QL and each tu- 
ple of individuals (a1,...,a,) we have that T AAY.B E y(ai,..., apg) implies 
TAAX.AE p(ai,..-,@~), where we view the TBox and the ABox as first-order 
formulae and F is classical first-order entailment (see [25] for more details). We 
say that two qABox are QL-equivalent w.r.t. T if they QL-entail each other 
w.r.t. T, and denote this equivalence relation as =4.- 

For EL ontologies, one usually considers instance queries (IQ) or conjunc- 
tive queries (CQ). The former are given by EL concept descriptions, viewed as 
first-order formulae with one free variable. The latter are basically qABoxes of 
the form 4X.A, but with the elements of Xı(3 X. A) viewed as free variables. 
Replacing these free variables with a tuple of individuals thus yields a qABox in 
the sense introduced above. In particular, this means that CQ-entailment cor- 
responds to entailment of the same qABoxes (see [7] for more details regarding 
the connection between conjunctive queries and qABoxes). 


3.1 Classical Entailment and CQ-Entailment 


Due to the close connection between conjunctive queries and qABoxes men- 
tioned above, it is easy to see that the classical entailment relation E7 between 
qABoxes, as introduced in the previous section, actually coincides with CQ- 
entailment Hlo To keep the notation more uniform and to distinguish this 
kind of entailment explicitly from IQ-entailment, we will usually talk about CQ- 
entailment and write Efo: 


Whenever we compare two qABoxes 3 X.A and 3Y., we assume without 
loss of generality that they are renamed apart, which means that X is disjoint 
with Xo(3 Y.B) and Y is disjoint with Xo(3 X. A), and we further assume that 
the two qABoxes speak about the same set of individual names X% = ¥\(4X.A)U 
Xı(3 Y.B). For the case of an empty TBox, it was shown in [7] that 3X. A A 
AY.B iff there is a homomorphism from 3Y. to IX.A. A homomorphism from 
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r-rule. If (Ci N --- N Cn)(t) E€ A, then remove this assertion from A, and add the 
assertions C1 (t), --- ,Cn(t) to A. 

d-rule. If (ar.C)(t) € A, then remove this assertion from A, add the two assertions 
r(t,x) and C(x) to A, and add x to X, where z is a fresh variable not occurring 
in Aor X. 

C-rule. If t € XYo(3 X.A), CE DET, AE C(t), and A jÆ D(t), then add the 
assertion D(t) to A. 


The [-rule has highest priority and the C-rule has lowest priority. 


Fig. 1: The CQ-saturation rules. 


AY.B to 1X.A is a mapping h: No(AY.B) > Xo(I3 X.A) such that h(a) = a 
for each a € X, A(h(u)) € A for each A(u) € B, and r(h(u),h(v)) € A for each 
r(u,v) € B. In order to obtain a similar characterization of entailment for the 
case of a non-empty TBox 7, we need to saturate the given qABox w.r.t. T. 

Basically, this saturation performs what is called the chase in the database 
community [26,20,10]. Given an E£ TBox T and a qABox 4X. A, it extends the 
ABox by new assertions that are implied by the TBox. The rules that realize 
this are described in Fig. 1. Their rôle is two-fold: whereas the C-rule adds new 
concept assertions that are implied by the ABox together with the TBox, the 
other two rules break down the complex concept assertions added by this rule 
into smaller parts. 

In general, applying these rules need not terminate; e.g., if applied to the 
qABox 4). {A(a)} for the TBox {A E dr. A}. There are various sufficient con- 
ditions that guarantee termination of the chase [13]. Here, we use a condition 
introduced in [1] in the context of unification in EL. 


Definition 1. The EL TBox T is cycle-restricted if there is no non-empty 
sequence of role names rı,..., rk and EL concept description C such that 
COT ari. Irg.C. 


As shown in [1], it can be decided in time polynomial whether a given E£ TBox 
is cycle-restricted or not. For cycle-restricted TBoxes, CQ-saturation always ter- 
minates. 


Theorem 2. Let T be a cycle-restricted EL TBor and 1X.A a qABox. Then 
exhaustive application of the CQ-saturation rules terminates in exponential time 
in the size of 1X.A and T, and yields a qABox satłą (a X.A) such that the 


following statements are equivalent for all qABozses AY.B: 


- IX.A Hlg IY.B, 
— satło(3 X.A) = AY.B, 
— there is a homomorphism from AY.B to satZg(4X.A). 


We can show that there are examples where the CQ-saturation of a qABox w.r.t. 
a cycle-restricted TBox is of exponential size, and thus its computation must take 
exponential time. Nevertheless, the entailment relation Eda can still be decided 
within NP by adapting results for conjunctive query answering in EL [30]. 
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r-rule. If (C1 0...0 Cn Be ) € A, then remove this assertion from A and add the 
assertions C1 (t),.. Cn(t) to A. 

d-rule. If (4r.C)(t) € =A, then remove this assertion from A, add the two assertions 
r(t, zc) and C(zc) to A, and add zc to X if it is not already there. 

C-rule. If t € Yo(AX.A), CO DET, AE C(t), and A jÆ D(t), then add the 
assertion D(t) to A. 


The [-rule has higher precedence than the d-rule, and the latter has higher precedence 
than the C-rule. 


Fig. 2: The IQ-saturation rules. 


3.2 |IQ-Entailment 


Recall that the qABox I X. A IQ-entails the qABox JY.B w.r.t. the EL TBox T if 
every concept assertion C(a) entailed w.r.t. T by the latter is also entailed w.r.t. 
T by the former. In the following we assume again that these two qABoxes 
are renamed apart. For the case of an empty TBox, it was shown in [7] that 
IX.A =o AY.B iff there is a simulation from 3Y. 8 to 3 X.A. A simulation from 
JY.6 to 3 X.A is a relation G C No(AY.B) x Xo(I X.A) such that (a,a) € G 
for each a € X and, for each (u,v) € ©, A(u) € B implies A(v) € A and 
r(u,u’) € B implies that there exists an object v’ € XU X such that (u’,v’) € G 
and r(v,v’) € A. Since checking the existence of a simulation can be done in 
polynomial time [16], we conclude that IQ-entailment between qABoxes can be 
decided in polynomial time for the case of an empty TBox. 

To extend these results to the case of a non-empty TBox, we again need 
to saturate the ABox w.r.t. the TBox. But now the saturation rules, given in 
Fig. 2, are more parsimonious w.r.t. the introduction of new objects. To be more 
precise, for each existential restriction dr.C € Sub(7), we assume that xc is 
a fresh variable not contained in the initial qABox 3 X.A. When applying the 
d-rule to an assertion of the form (Ar.C)(t), we always use this variable for 
the successor object. Due to this restriction, IQ-saturation always terminates, 

e., it is not necessary to impose any restrictions on the TBox. Also note that 
IQ-saturation basically generates a qABox representation of what is called the 
canonical model in [25, Section 5.2}. 


Theorem 3. Let T be an EL TBox and jX.A a qABox. Then exhaustive ap- 
plication of the |Q-saturation rules es in polynomial time in the size of 
IX.A and T, and yields a qABoz sat, Q(AX.A) such that the following state- 


ments are equivalent for all gA Boxes 4)°B: 
- 3X.A HQ a ra 
— sath (3 X.A) Ee JY.B, 
— there is a simulation from 3Y.B to sath (3 X.A). 


Since sath (a X.A) can be computed in polynomial time and the existence of a 
simulation can be decided in polynomial time, this shows that the entailment 
relation Eh can be decided in polynomial time. 
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4 Canonical Repairs 


We specify what is to be repaired by a finite set of EL concept assertions, which 
we call a repair request. A repair is a qABox that does not have any of these 
assertions as a consequence. This generalizes previous repair approaches [6] in 
that more than one consequence specified as unwanted is removed in one step. 
It also encompasses the notion of a privacy policy, as introduced in [7], which 
specifies forbidden concepts, with the meaning that one should not be able to 
derive that any of the individuals occurring in the qABox is an instance of such 
a concept. We assume that the TBox is static (i.e., may not be changed by the 
repair) and consider both CQ- and IQ-entailment for comparing qA Boxes. 


Definition 4. Let T be an EL TBoz and QL € {CQ, IQ}. 


— An EL repair request is a finite set of EL concept assertions. 

— Given a qABoxr3 X.A and an EL repair request R, a QL-repair of 3 X.A for 
R w.r.t. T is a gABox AY.B such that 1X.A H}, FY.B and 3Y.B T C(a) 
for all C(a) ER. 

— Such a repair 3Y.B is optimal if there is no QL-repair 1Z.C of IX.A for 
R w.r.t. T such that 3Z.C =g JY.B and3Z.C FOL qY.B. 


Intuitively, a repair is a qABox that has no new consequences of the specified 
type (instance relationships or answers to conjunctive queries), and no longer 
has the consequences forbidden by the repair request. In an optimal repair, a 
minimal amount of consequences of the specified type is lost. Since there are 
different options for what to change when repairing a qABox, there may exist 
several non-equivalent optimal repairs. 

In the following, let QL € {CQ, IQ} and let T be a fixed TBox, which is 
assumed to be cycle-restricted if QL = CQ. In addition, let R be a repair request 
and 1X.A be the qABox to be QL-repaired for R w.r.t. T. We assume that 
R does not contain an assertion of the form C(a) such that T E7 C since 
the presence of such an assertions would preclude the existence of a repair. If R 
satisfies this restriction, then the empty qABox 40.0) is always a repair. However, 
as mentioned in the introduction, this does not imply that there is an optimal 
repair. We will show that, for the case of IQ-entailment, optimal repairs always 
exist. For CQ-entailment, this is the case if the TBox 7 is cycle-restricted. In 
both cases, the set of optimal repairs covers all repairs in the sense that each 
repair is entailed by some optimal repair. 

As mentioned in the introduction, to deal with TBoxes, the approach for 
computing so-called canonical repairs from [7] needs to be adapted in two ways. 
First, one needs to QL-saturate the given qABox w.r.t. the TBox. Second, when 
computing canonical repairs from sat, (4 X. A), the construction needs to ensure 
that the TBox does not reintroduce consequences that have been removed by 
the repair. The main idea underlying the construction of canonical repairs is to 
introduce variables as copies of the objects occurring in satd, (a X.A). Such a 
variable is of the form yu,x, where the first component of the subscript says that 
this is a copy of the object u. The second component K is a set of atoms, with 
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the intuitive meaning that yu < must not be an instance of any element of K. To 
avoid introducing unnecessary copies, certain restrictions were imposed in [7] on 
the sets K. We add a further restriction that takes care of the TBox. 

To be more precise, let Sub(R, 7) be the set of subconcepts of concept de- 
scriptions occurring in R or 7, and let Atoms(R, T) be the set of atoms occurring 
in Sub(R, T). The set K in a variable yu < must be a repair type for u. 


Definition 5. Let JY.B := satQ, (4 X.A) and let u be an object name occurring 
in B. A repair type for u is a subset K of Atoms(R, 7) that satisfies the following: 


1. B |’ C(u) for each atom C € K, 

2. if C,D are distinct atoms in K, then C z’ D; 

3. K is premise-saturated w.r.t. T, ie., for all C € Sub(R,T) with B |’ C(u) 
and CET D for some DEK, there is E € K such that C C’ P. 


The first two conditions coincide with the ones in [7]. Basically, 1. says that 
we only need to remove instance relationships explicitly if they are really there. 
Condition 2. corresponds to the fact that preventing D(Yyu,c) as a consequence 
also prevents C (yu x) if D subsumes C, and thus C € K would be redundant if 
D € K. Condition 3. ensures that instance relationships that are removed due to 
K cannot be re-introduced by the TBox. It is easy to see that the set of repair 
types for u can be computed in exponential time. 

Similarly to the approach in [7], canonical repairs are induced by seed func- 
tions. Such a function determines, for each individual, which instance relation- 
ships should be prevented in order to obtain a repair. 


Definition 6. A repair seed function is a function s that maps each individual 
name b E€ Xı(I X.A) to a repair type s(b) for b that satisfies the following: 


— if C(b) € R and satg, (I X.A) = C(b), then s(b) contains an atom D such 
that C T? D. 


Using our general assumption that the repair request R does not contain a 
concept assertion C (a) with T C7 C, we can show that there is always at least 
one repair seed function. Each repair seed function induces a repair as follows. 


Definition 7. Given a repair seed function s, we define the canonical QL-repair 
rep, (J X.A,s) induced by s as the qABor 3Y.B where 


1. the set Y consists of the variables yu x for all object names u occurring in 
satg, (4 X.A) and all repair types K for u, except for the case where u is an 
individual name and K = s(u), and 

2. the matrix B consists of the following assertions, where we use Yy s(») as a 
synonym for the individual name b: 

— A(yux) E€ B for each concept assertion A(u) in satd, (4 X.A) such 
that A€¢ K, 

— 7r(Yuc,Yv,c) € B for each role assertion r(u,v) in satQ, (3 X.A) such 
that the following holds for eachAr.C € K: if the matrix of sat, (A X.A) 
entails C(v), then the set L contains an atom that subsumes C. 
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Our construction of canonical repairs based on seed functions is sound and 
complete in the following sense. 


Proposition 8. For each repair seed function s, the induced canonical repair 
rep, (J X.A,s) is a QL-repair of IX.A for R w.r.t. T. Conversely, if Y.B is 
a QL-repair of 1X.A for R w.r.t. T, then there is a repair seed function s such 
that rep, (AX.A,s) FQ, AY.B. 


We define the set of all canonical QL-repairs of 3 X.A for R w.r.t. T as 
Repairs, (J X.A, R) = {rep (3 X.A, s) | s is a repair seed function }. 


As an easy consequence of Proposition 8 we obtain that Repairs, (4 X.A, R) 
contains all optimal repairs (up to equivalence). However, as in the case with- 
out a TBox, it may also contain non-optimal repairs [7]. To compute the set 
of optimal repairs, one thus needs to remove such non-optimal elements from 
Repairs, (J X.A,R). Since the entailment test required for this is NP-complete 
for QL = CQ and polynomial for QL = IQ, we obtain the following theorem. 


Theorem 9. There is a (deterministic) algorithm that computes the set of all 
optimal QL-repairs of 1X.A for R w.r.t. T and runs in exponential time. If 
QL = CQ, then this algorithm needs access to an NP oracle, whereas no such 
oracle is required for QL = IQ. 


5 Optimized Repairs 


The construction of the canonical repair induced by a seed function described in 
the previous section usually introduces an exponential number of copies for the 
objects occurring in the saturated qABox. The following example demonstrates 
that this is not always necessary to obtain an optimal repair. 


Example 10. Let T := @ and consider the repair request {(4r.(AiN...™An))(a)} 
for the qABox 4 {x}. {r(a, x), Ai(x),...,An(x)}. There is only one repair seed 
function s, which assigns {4r.(A, N... M A,)} to a. Both for the CQ and the 
IQ case, the canonical repair induced by s contains 2” copies of x, namely all 
the variables ys for K C {Aj,...,A,}. However, most of these copies are 
redundant. In fact, we will see below that there are optimal repairs equivalent 
to the canonical one that contain only linearly many variables in n, both for the 
CQ and the IQ case. 


The idea is now to construct, for a given seed function, a set of variables that 
is a (hopefully small) subset of the set Y introduced in Definition 7, which is 
nevertheless sufficient to obtain a repair equivalent to the canonical one. Note, 
however, that in general an exponential blow-up cannot be avoided, as already 
shown in [5] for the case of EL instance stores. Throughout this section, we 
assume that QL, 7, R, and 3 X.A satisfy the properties assumed in the previous 
section. In addition, we assume that the repair request R is reduced, i.e., every 
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concept occurring in a concept assertion in R is reduced, and if R contains C(a) 
and D(a) for distinct concept descriptions C, D, then C Z’ D, and we further 
assume that each concept occurring in the TBox 7 is reduced. Before we can 
describe our construction of the set of relevant variables, we must introduce some 
notation and show an auxiliary result. 

Given two sets of concept descriptions K and £, we say that £ covers K 
(written K < £) if each concept in K is subsumed by some concept in £. 

Now, let s be a repair seed function and set JY.B := repd, (3 X.A, 8). 
Recall that, according to Definition 7, a role assertion r(yi,c,Yu,c) belongs 
to the matrix 6 iff the saturation sat, (4 X.A) contains the role assertion 
r(t,u) and the repair type £ covers the set Succ(K,r,u) = { C | 3r.C € 
K and the matrix of satg, (3 X.A) entails C(u) }. 

If £ does not satisfy this requirement, there might be another repair type 
L’ such that the canonical repair contains the assertion r(yz,c, Yu,c’), and thus 
our optimized repair needs to contain an appropriate variable to which yu, g’ can 
be mapped by a homomorphism or simulation. We generate such variables by 
looking for repair types M that cover both £ and Succ(K,r,wu). The set of all 
such repair types can effectively be computed, though it might be empty. For 
our purposes, it is sufficient to use only the ones that are minimal w.r.t. the 
cover relation <. 


Lemma 11. The set of all <-minimal repair types for u that cover LU 
Succ(K,r,u) can be computed in exponential time. 


In general, this computation may produce exponentially many repair types, but 
this is not always the case. For instance, consider a = Ya,s(a) and yr g in Exam- 
ple 10. We have Succ(s(a),7,2) = {AiN...11.A,,} and thus the assertion r(a, yx 9) 
is not in B since Ø clearly does not cover Succ(s(a), 7,2). The <-minimal repair 
types covering Succ(s(a),r,x) are exactly the sets {A;} for i =1,...,n. 

In the following, we construct a sequence Yo, Yi,...,Ym of subsets Y; of Y 
such that JY.B is QL-equivalent to its sub-qABox 3 Ym. Bm where Bm contains 
only those assertions in B involving object names in ©) U Ym. Recall that we use 
Ya,s(a) aS Synonyms for the individuals a € 3}. 

We start with the set Yọ, which is empty if QL = IQ, and equal to the set 
{yzg | tis an object name occurring in satZQ(J X.A)} if QL= CQ. 

The subsequent sets are obtained by exhaustively applying one of the follow- 
ing rules, depending on whether QL = CQ or QL = IQ. 


CQ-construction rule. If yx and yu,c are elements of X U Y;, the satu- 
ration satło (3 X.A) contains the role assertion r(t,u), the repair type £ 
does not cover Succ(K,r,u), and M is a <-minimal repair type for u that 
covers £L U Succ(K,r,u), but yum is not contained in X, U Y;, then set 
Yi+ı = Y; U {yum}. 


IQ-construction rule. If yx is an element of 4 U Y;, the saturation 
sato (a X.A) contains the role assertion r(t,u), and M is a <-minimal 
repair type for u that covers Succ(K,r, u), but yum is not contained in 
X U Y;, then set Yi41 = Y; U {yum} 
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The sets Y; are all subsets of the set Y of variables in the canonical repair. Since 
each rule application adds a variable, the exhaustive application of rules must 
terminate after finitely many steps with a set of variables Ym C Y. 

Let us illustrate this construction using Example 10, first for the IQ case. 
We have a = Ya,s(a) E Xı and the assertion r(a,x) belongs to the saturation, 
which is equal to the original qABox. As mentioned above, the <-minimal repair 
types covering Succ(s(a),7r,x) are exactly the sets {A;} for i = 1,...,n. Thus, 
repeated applications of the |Q-construction rule add the variables yz {4;}, and 
the construction ends with YQ = {Yz {a} |¢=1,...,n}. In the CQ case, the 
initial set of variables is yo = {Yap Yx ø}. In this example, the CQ-construction 
rule then generates the same variables as the IQ rule, though this need not be 
the case in general. We end up with the final set ye U wA 


Definition 12. Let s be a repair seed function and Ym C Y be the set of 
variables obtained by an exhaustive application of the QL-construction rule. 
The optimized QL-repair of IX.A for R w.r.t. T induced by s, denoted by 
orep (4 X.A, s), is the qABor 3Ym.Bm where the matrix Bm contains all as- 
sertions in B involving only object names in X1 U Ym. 


Note that, to compute Bm, we need not compute the larger matrix B first. 
Instead, we just apply the definition of the matrix in Definition 7 to the object 
names in 3 U Ym. 

In our example, the optimized IQ-repair is the qABox JY/°.B,, with 


Bm ={1(4,Yx,¢ai}) |LStsn}U{Ajefay) |i At and 1 ij <n}. 


In the optimized CQ-repair, the quantifier prefix additionally contains the 
variables yap and y,9, and the matrix additionally contains the assertions 
T(Ya 0: Veo) and A;(y,g) for i = 1,...,n. Note that, without these assertions, 
the positive answer to the Boolean conjunctive query dy, z.(r(y,z) A Ai(z) A 
...A An(z)) would be lost. 

Coming back to the general case, we first observe that the canonical QL- 
repair induced by s QL-entails the optimized QL-repair induced by s due to the 
inclusion relationship between these two qABoxes. The entailment in the other 
direction also holds, but this is harder to show, in particular for QL = CQ. 


Proposition 13. For each repair seed function s, the optimized QL-repair in- 
duced by s QL-entails the canonical QL-repair induced by s. 


Proof sketch. For QL = IQ, the proposition can be proved by showing that the 
following relation G is a simulation from 3 Y.B to 3 Ym. Bm: 


G= { (YK, Yek) | Yt K E So(s Y.B), Yt K’ E Sola Ym: Bra), and K’ < K}. 


For QL = CQ, we introduce a sequence of mappings ho, h1,..., hn: Yo(AY.B) > 
Xo(AYm.Bm), starting with ho(yc) = Yes) ift € X and s(t) < K and 
ho(¥t,c) = Y+ Otherwise. The initial mapping ho need not be a homomorphism 


Optimal Repairs w.r.t. Static TBoxes 321 


since role assertions may not be preserved. In the step-wise construction of the 
mappings h; such defects are corrected, one by one. We can show that this con- 
struction always terminates after finitely many steps, yielding a homomorphism 
hn from JY.B to 3 Ym. Bm. 


Summing up, we have thus shown the following theorem, which implies that 
the optimized repairs also satisfy the properties stated in Proposition 8. 


Theorem 14. For each repair seed function s, the canonical QL-repair induced 
by s and the optimized QL-repair induced by s are QL-equivalent. 


6 Evaluation 


To find out whether the repair approaches introduced in this paper are in prin- 
ciple viable for non-trivial ontologies, we made experiments for both IQ and CQ- 
repairs with a first, rather unoptimized implementation. In addition to checking 
how often the implementation was able to compute a repair within a certain 
timeout, we also compared the sizes of optimized repairs with those of canonical 
repairs. We considered two different repair scenarios: repairing a single unwanted 
consequence for a single individual (S1), and repairing a single unwanted conse- 
quence for 10% of the individuals occurring in the ABox (S2). We report here 
the main results—more details and discussions can be found in [4]. 

As corpus for our evaluation, we chose the ontologies used in the 2015 OWL 
Reasoner Competition for the track OWL EL Realisation [28], since they contain 
a substantial amount of ABox assertions. These 109 ontologies were converted 
into pure EL by applying standard transformations and afterwards filtering out 
unsupported axioms. From these ontologies, we kept those that had at most 
100,000 axioms in total. The resulting corpus contained 80 ontologies. 

We implemented our methods in Java, using the OWL-API! for parsing 
OWL ontologies, and ELK [22] for precomputing any subsumption relationships 
entailed with and without the TBox potentially relevant for our repair approach. 
The code is available online.? All experiments were performed on an Intel(R) 
Core(TM) i5-4590 CPU with 4 cores and 32 GB RAM, of which we assigned 16 
GB as maximal heap space to the Java VM. 

Since it is a precondition of our repair approach, we first saturated the on- 
tologies using the IQ-saturation rules of Figure 2, and the CQ-saturation rules 
of Figure 1. The CQ-saturation rules were implemented using the rule engine 
VLog [11] through the Java facade Rulewerk.? As CQ-saturation only termi- 
nates for cycle-restricted TBoxes, we only considered those ontologies for the 
CQ-saturation whose IQ-saturation did not introduce cycles between introduced 
variables. We used a timeout of 60 minutes for every saturation. This way, we 
successfully computed IQ-saturations of every ontology, and 62 CQ-saturations. 


1 http: //owlapi.sourceforge.net 
? https: //github.com/de-tu-dresden-inf- lat /abox-repairs-wrt-static-tbox 
3 https: //github.com/knowsys/rulewerk 
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The size of the saturated ABox was usually not much larger than that of the 
original one, and always less than two orders of magnitude larger. Interestingly, 
the successful CQ-saturations were rarely larger than the IQ-saturations, and 
often even of the same size, because no variables were added. 

Scenario S1 was about repairing a single faulty entailment A 7 C(a). Since 
we did not have information about whether any entailments from the considered 
ontologies are faulty, we generated such assertions randomly. For this, we looked 
at entailments of the form A 7 C(a), where C € Sub(7). To make the repair 
requests more interesting, we furthermore required that C is not of the form 
A or dr.T, where A is a concept name. This requirement already ruled out 54 
of the IQ-saturated ontologies, and 44 of the CQ-saturated ontologies, as they 
did not have any complex entailments of the required form. For Scenario $2, we 
randomly selected some concept C € Sub(7) which had at least one instance 
(surprisingly, although C was not required to be complex, this ruled out 12 on- 
tologies, including 4 of the CQ-saturated ones), together with a random selection 
of 10% of the individuals in A, and built the repair request consisting of all as- 
sertions C(a) where a ranges over the selected individuals. For both scenarios, 
we selected a random seed function for the obtained repair request. 

For each ontology, scenario, and QL € {IQ, CQ}, we attempted to compute 
optimised QL-repairs for 50 different repair requests. We also tried to compute 
the set of objects that would be included in the canonical repairs, to get an idea 
of the impact of our optimisation. For each such repair computation, we used a 
timeout of 10 minutes. Since all repair requests used only concept descriptions 
that were already in the input ontology, the number of objects in the canoni- 
cal repair was independent of the repair request. We thus performed the latter 
computation only once for each ontology. The success rates were as follows: 


— The objects included in the canonical IQ- and CQ-repair could be computed 
within the timeout and without memory exceptions for respectively only 
52.9 % and 62.1 % of the ontologies. 

— For S1, we could compute the optimized IQ-repair in 99.9 %, and the opti- 
mised CQ-repair in 100.0 % of all attempts. 

— For 82, 98.9 % of IQ-repairs and 99.9 % of CQ-repairs were successful. 


This shows that the optimizations introduced in Section 5 have a very positive 
impact on the viability of our repair approach. 

Fig. 3 gives more information on the number of objects and assertions in the 
computed repairs. On the left, we consider canonical and optimised IQ-repairs 
for scenario $2: specifically, we look at the difference in numbers of individuals 
occurring in the repair compared to the input ABox. In the middle and on the 
right, we visualise the difference between the number of assertions in the opti- 
mized IQ- and CQ-repairs, compared to the input ABoxes, for the scenarios S1 
and 82, respectively. By construction, CQ-repairs cannot contain less assertions 
than the input ontologies. Sometimes the CQ-repairs were smaller than the cor- 
responding IQ-repairs, which is due to the different saturation methods: variables 
introduced by the IQ-saturation could be connected to more individuals than for 
the CQ-saturation. 
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Fig. 3: Evaluation results. On the left, we show the difference of the number of ob- 
ject names in the canonical IQ-repairs (purple triangle) with the same difference, 
but restricted to objects occurring in assertions, for the optimised IQ-repairs (red 
circle) for S2. The other two graphs consider optimised IQ- and CQ-repairs for 
S1 and S2. In each graph, the x-axis shows the number of assertions in the input 
ontology, and the y-axis the observed difference. 


7 Conclusion 


This paper presents approaches for repairing DL-based ontologies, in the sense 
that they allow to get rid of unwanted consequences. In contrast to most of the 
other work on ontology repair, our goal is to compute optimal repairs, i.e., ones 
that lose the least amount of other consequences. As relevant consequences to 
be preserved, we consider both answers to conjunctive queries (CQ) and answers 
to EL instance queries (IQ). The presented results improve on our previous work 
in this direction in two respects. First, we allow for the presence of a TBox, 
which is assumed to be static (i.e., cannot be changed by the repair), whereas 
before we assumed that the TBox is empty. Second, we develop a more efficient 
construction of optimal repairs, which is exponential only in the worst case. Our 
experimental results show that this optimization makes our repair approach 
viable also for fairly large ontologies, at least for the IQ case. 

One question for future research is how to lift the restriction to cycle- 
restricted TBoxes in the CQ case. Since optimal repairs need not longer ex- 
ist then, one can ask whether the existence question is decidable, and how to 
compute optimal repairs if they exist. We have already noticed in our first at- 
tempts to tackle this problem that optimal repairs may then become larger than 
single-exponential. 

In this and in our previous work, we have assumed that unwanted conse- 
quences are specified as EL instance relationships. Another interesting open 
question is whether our results can be generalized to a setting where unwanted 
consequences are specified as answers to conjunctive queries, as e.g. in [14].+ 


4 Note that no TBox is considered in [14], and the notion of optimality used there is 
different from ours (see the introduction of [7] for a discussion of the differences). 
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Abstract. We prove the SOS strategy for first-order resolution to be 
refutationally complete on a clause set N and set-of-support S if and only 
if there exists a clause in S that occurs in a resolution refutation from NU 
S. This strictly generalizes and sharpens the original completeness result 
requiring N to be satisfiable. The generalized SOS completeness result 
supports automated reasoning on a new notion of relevance aiming at 
capturing the support of a clause in the refutation of a clause set. A clause 
C is relevant for refuting a clause set N if C occurs in every refutation of 
N. The clause C is semi-relevant, if it occurs in some refutation, i.e., if 
there exists an SOS refutation with set-of-support S = {C} from N\{C}. 
A clause that does not occur in any refutation from N is irrelevant, i.e., it is 
not semi-relevant. Our new notion of relevance separates clauses in a proof 
that are ultimately needed from clauses that may be replaced by different 
clauses. In this way it provides insights towards proof explanation in 
refutations beyond existing notions such as that of an unsatisfiable core. 


1 Introduction 


Shortly after the invention of first-order resolution [14] its first complete refine- 
ment was established: set-of-support (SOS) resolution [18]. The idea of the SOS 
strategy is to split a current clause set into two sets, namely N and S and re- 
strict resolution inferences to have one parent from the set-of-support S. Wos 
et al. [18] proved the SOS strategy complete if N is satisfiable. The motivation 
by Wos et. al. for the SOS strategy was getting rid of “irrelevant” inferences. 
If N defines a theory and S contains the negation of a conjecture (goal) to be 
refuted, the strategy puts emphasis on resolution inferences with the conjecture. 
This can be beneficial, because resolution is deductively complete (modulo sub- 
sumption) [11,13], i.e., resolution inferences solely performed on clauses from N 
will enumerate all semantic consequences, not necessarily only consequences that 
turn out to be useful in refuting N US. Even in more restrictive contexts, the 
SOS strategy can be shown complete, e.g., if N is saturated by superposition 
and does not contain the empty clause, then the SOS strategy is also complete 
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in the context of the strong superposition inference restrictions on N and a 
set-of-support S [2]. 


In this paper, we generalize and sharpen the original completeness result for 
the SOS strategy: The resolution calculus with the SOS strategy is complete if and 
only if there is at least one clause in S that is contained in a resolution refutation 
from N U S, Theorem 11. The proof is performed via proof transformation. Any 
(non SOS) refutation from N U S can be transformed into an SOS refutation 
with SOS S, if the original refutation contains at least one clause from S. 


The generalized SOS completeness result supports our new notion of relevance 
that is meant to be a first stop towards explaining the gist of a refutation. A 
clause C € N is relevant if it is needed for any refutation of the clause set N. 
The clause C is semi-relevant if there is a refutation from N using C and C is 
irrelevant otherwise, Definition 12. Applying our generalized SOS completeness 
result, a clause C € N is semi-relevant if and only if there is an SOS refutation 
from N \ {C} with SOS {C}. 


The interest in semi-relevant clauses comes from real-world applications. In 
an industrial scenario where different products are built out of a building set, 
the overall product portfolio is often defined by a set of clauses (rules). Roughly, 
every clause describes the integration of some part out of the building set in a 
product. Different proofs for the existence of some product correspond to different 
builds of the product. For example, answering a question like “Can we build car 
x with part y?” from the automotive world boils down to the semi-relevance 
of the clauses defining part y in a refutation showing the existance of a car x. 
All German car manufacturers maintain such clause sets defining their product 
portfolio [6,17]. 


Our new notion of relevance is related to other notions capturing aspects of 
a refutation. A minimal unsatisfiable core of an unsatisfiable clause set contains 
only semi-relevant clauses. The intersection of all minimal unsatisfiable cores is 
the set of relevant clauses. The notion of a minimal unsatisfiable core does not 
provide a test for semi-relevance of a specific clause. There are various notions 
from the description logic community related to unsatisfiable cores of a translation 
to first-order and/or to our notion of relevance [1,4,8,16]. An in-depth discussion 
of these relationships can be found in our description logic workshop paper [7]. 
The notion of relevant clauses is also related to what has been studied in the 
field of propositional satisfiability under the name of lean kernels [9,10]: Given 
an unsatisfiable set N of propositional clauses, the lean kernel consists exactly 
of those clauses that are involved in at least one refutation proof of N in the 
resolution calculus, and thus, in our terminology, the set of semi-relevant clauses. A 
different notion of relevance was previously defined in the context of propositional 
abduction [5]. The authors provide algorithms and complexity results for various 
abduction settings in the propositional logic context. In addition to the fact 
that our notion of relevance is defined with respect to first-order clauses, in their 
context of propositional abduction, if a propositional variable is relevant, it must 
be satisfiability preserving when added to the theory (clause set). In our case, if 
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a clause C € N is (semi-)relevant, then N is unsatisfiable and N \ {C} may be 
unsatisfiable as well. 

The paper is organized as follows. After fixing some notations and notions at 
the beginning of Section 2 we introduce our proof transformation technique. First 
on an example, Figure 1, then in general. The following Section 3 proves important 
properties of the transformation, yielding our generalized completeness result for 
SOS, Theorem 11. We then link the SOS completeness result to our notion of 
semi-relevance in Section 4. The paper ends with a summary, a discussion of the 
contributions, and directions for future work, Section 5. 


2 Resolution Proof Transformation 


After fixing some common notions and notation, this section introduces our proof 
transformation technique. First on an example and afterwards on resolution 
refutations in general. 

We assume a first-order language without equality where N denotes a clause 
set; C, D denote clauses; L, K denote literals; A,B denote atoms; P,Q, R,T 
denote predicates; t,s terms; f,g,h functions; a,b,c constants; and x,y,z vari- 
ables, all possibly indexed. Atoms, literals, clauses and clause sets are consid- 
ered as usual. Clauses are disjunctions of literals. The complement of a lit- 
eral is denoted by the function comp. Semantic entailment = considers vari- 
ables in clauses to be universally quantified. Substitutions 0,7 are total map- 
pings from variables to terms, where dom(c) := {x | so # x} is finite and 
codom(c) := {t | xo = t,x € dom(c)}. A renaming o is a bijective substitution. 
The application of substitutions is extended to literals, clauses, and sets/sequences 
of such objects in the usual way. The function mgu denotes the most general 
unifier of two terms, atoms, literals if it exists. We assume that any mgu of two 
terms or literals does not introduce any fresh variables and is idempotent. 

The resolution calculus consists of two inference rules: Resolution and Fac- 
toring [14,15]. The rules operate on a state (N, S) where the initial state for 
a classical resolution refutation from a clause set N is (Ø, N) and for an SOS 
refutation with clause set N and initial SOS S the initial state is (N, S). We 
describe the rules in the form of abstract rewrite rules operating on states (N, S). 
As usual we assume for the resolution rule that the involved clauses are variable 
disjoint. This can always be achieved by applying renamings to fresh variables. 


Resolution (V,SW{CV K}) >rzrzs (N,SU{CV K,(DV C)o}) 
provided (DV L) € (NUS) and o = mgu(L, comp(K)) 


Factoring (N,SW{CVLVK}) >res (N,SU{CVLV K}U{(CV L)c}) 
provided o = mgu(L, K) 


The clause (D V C)ø is called the result of a Resolution inference between 
its parents. The clause (C V L)o is called the result of a Factoring inference 
of its parent. A sequence of rule applications (N, S) >hpg (N,S") is called a 
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resolution derivation. It is called an SOS resolution derivation if N Æ 0. In case 
1L € 5 it isa called a (SOS) resolution refutation. 


Theorem 1 (Soundness and Refutational Completeness of (SOS) Res- 
olution [14,18]). Resolution is sound and refutationally complete [14]. If for 
some clause set N and initial SOS S, N is satisfiable and N U S is unsatisfiable, 
then there is a derivation of L from (N, S) [18]. 


Where a resolution derivation (N, S) >is (N, S’) shows how new clauses 
can be derived from (N, S), a deduction presents the minimal derivation of a 
single clause, e.g., the empty clause L in case of a refutation. For deductions we 
require every clause to be used exactly once, so deductions always have a tree 
form. This is a purely technical restriction, see Corollary 5, that facilitates our 
deduction transformation technique that then needs not to take care of variable 
renamings except for input clauses. 


Definition 2 (Deduction). A deduction ty = [Ci,...,Cy] of a clause Cn 
from some clause set N is a finite sequence of clauses such that for each Ci the 
following holds: 


1.1 C; is a renamed, variable-fresh version of a clause in N, or 

1.2 there is a clause Cj E€ mn, j <i s.t. Ci is the result of a Factoring inference 
from C}, or 

1.3 there are clauses Cj, Ck E TN, J < k <i s.t. Ci is the result of a Resolution 
inference from Cj and Cx, 


and for each C; € TN, i < n: 


2.1 there exists exactly one factor Cj of Ci with j >i, or 
2.2 there exists exactly one Cj and Cp such that Cp is a resolvent of Ci and C; 
and i,j < k. 


We omit the subscript N in ny if the context is clear. 


A deduction 7’ of some clause C € m, where m, 7’ are deductions from 
N is a subdeduction of m if m’ C a, where for the latter subset relation we 
identify sequences with multisets. A deduction my = [C1,...,Cn—-1, L] is called 
a refutation. 

Note that variable renamings are only applied to clauses from N such that 
all clauses from N that are introduced in the deduction are variable disjoint. 


Definition 3 (SOS Deduction). A deduction nyus = [Ci,...,Cn] is called 
an SOS deduction if the derivation (N, So) >hgs (N, Sm) is an SOS derivation 
where Cy,...,C",, is the subsequence from [C1,...,Cn] with input clauses removed, 
So = S, and Si+ = S; U Chat. 


Definition 4 (Overall Substitution of a Deduction). Given a deduction m 
of a clause Cn the overall substitution 7,,; of Ci E m is recursively defined by 


1 if C; is a factor of C; with j <i and mgu o, then Tri = Tr j O90, 
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2 if C; is a resolvent of Cj and Cy with j < k < i and mgu o, then Tri = 
(Tr,j O Tr,k) OO; 
3 if C; is an initial clause, then Tr i = 0, 


and the overall substitution of the deduction is Tr = Tr,n. We omit the subscript 
T if the context is clear. 


Overall substitutions are well-defined, because clauses introduced from N 
into the deduction are variable disjoint and each clause is used exactly once in 
the deduction. A grounding of an overall substitution T of some deduction 7 is 
a substitution 76 such that codom(7ô) only contains ground terms and dom(ô) 
is exactly the variables from codom(r). 


Corollary 5 (Deduction Refutations versus Resolution Refutations). 
There exists a resolution refutation (N,S) + hag (N, S' U{L}) if and only if 
there exists a deduction refutation tyus) = [C1,---,;Cn—1, L] where C; € (NUS") 
for alli, modulo variable renaming. 


We prove the generalized completeness result of SOS by transforming non- 
SOS refutations into SOS refutations. For illustration of our proof transformation 
technique, consider the below unsatisfiable set of clauses N. Literals are labeled 
in N by a singleton set of a unique natural number [12]. We will refer to the 
literal labels during proof transformation in order to identify resolution and 
factorization steps. The labels are inherited in a resolution inference and united 
for the factorized literal in a factoring inference. See the factoring inference on 
clause (3), Figure 1. 


Figure 1 shows a resolution refutation 
z = [(5),(6), (7), (1); (2), (3), (4), (8), (9), (10), (11), (12)] 
from N. This resolution refutation is also an SOS refutation with SOS S = 
{(2), (5)} and remaining clause set N \ S. It is not an SOS refutation with SOS 
S = {(5)} and the remaining clause set N \ S because the resolution step between 
clauses (1) and (2) is not an SOS step. The shaded part of the tree belongs to 
an SOS deduction with S = {(5)}. 

The transformation identifies a clause closest to the leaves of the tree, obtained 
by resolution, that has one parent that can be derived by the SOS strategy, but 
the other parent is not in the SOS nor an input clause. For our example with 
starting SOS S = {(5)} this is clause (8). The parent (7) can be derived via SOS 
from S but the other parent (4) is not part of an SOS derivation. The overall 
grounding substitution of m is T = {£1 > b, £2 > a, £3 > b, x4 œ> f(a), z5 => 
a,xzę ++ a}. Now the idea of a single transformation step is to perform the 
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((2){3}4P (wa) V {4} 9Q(b,24)) 


():15}-Q66,a) V {6}Q(a1,f (a6) } ( (1):{1}-Q(as,f(@)) v {2}P(f(@))} | {aa f(a} 


(OH7}Q(0,02) V {8} R@2) V (9}T21) |(@HD-Qes,F(@) V (4-Q0,F@)) 


tt 4 a} {x3 = b} 
((16}Q(@1,F(wo)) V {8}R(a) V {9}T(c,21) ) (1,44 =Q, f(a))) 
{z1 > b, £6 > ü 
((9):{10}-R(es) ((8){8}R(a) V {9T (c,b)) 
{tgs m a 


((10):{9}7(c,b) (ADH1-7(e, b) ) 


(12): L 


Fig. 1. Refutation of m of N 


resolution step on the labelled literal {1,4}=Q(b,f(a)) and the respective literal 
{6}Q(21,f(x6)) of the SOS derivable clause (7) already on the respective literals 
from the input clauses yielding (8), here clauses (1) and (2). To this end the 
derivation [(5), (6), (7)] is copied with fresh variables, see Figure 2, yielding the 
clauses (7) and (7’) used in the refutation 7’ below, see also Figure 3. 


((5):{5}>Q(6.4) V {6} Q(27,F (9) ((6):{7}Q %28) V {8} R(ws) V {9} T(c,27)) 
Ts r+ a} 


( (7):{6}Q(@7,f(v9)) V {8}R(a) V {9} T(c,27) ] 


Fig. 2. The copied subdeductions deriving (7) 


The two freshly renamed copies (7) and (7’) are resolved with the respective 
input clauses (1) and (2). Finally, the rest of the deduction yielding clause (8) 
is simulated with the resolved input clauses, see Figure 3. Now (8’”’) is exactly 
clause (8) from the original deduction 7, but (8’”) is derived by an SOS deduction. 
The deduction can then be continued the same way it was done in 7 and in this 
case will already yield an SOS refutation. 

T = [(5), (6); (7), (5°), (67), (7); (1), 1), (2), (2°), (8), (8, (8), 
(9), (10), (11), (12)] 
The example motivates our use of literal labels. Firstly, they tell us which literals 
from input clauses need to be resolved: here the literals {1}7Q(a3,f(a)) and 
{4} =Q(b,x4) that are factorized in 7 to {1,4}7Q(b,f(a)). Secondly, they guide 
additional factoring steps in 7’ during the simulation of the non-SOS part from 
m: here the factoring between the two literals labelled {8} in clause (8’) and 
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(A13, f(a)) V {2}P(F(@) J ((2):{3}4P(#a) V {4} 4Q(6,24) } 


((7){6}@@r0,f 11) V {8}R(a) v {9}T(c,x10) } 


{ta > f(r11)} 


((7):{6} (a7. (29)) V {8} R(a) V {9T (c27) ) x19 +> b} 


Ts > a, £7 +> £3} 
(@)H8}R(@) V {9}T (cvs) V {2}P(F(a))) (2)48}R@ V {9T (cb) V {4}4P(F(e11))) 


_ tmn a) 


((8){8}R(a) V {9}T(c,a3) V {8}R(a) V {9}T(c,b) } 


((8'):{8} R(a) V {9}T (ca) V {9}T(c,b) } 
{x3 + b} 


(8) {8} R(@) v {9}7(c.0)] 


Fig. 3. The new SOS deduction yielding a copy of clause (8) 


the two literals with label {9} in clause (8”). The transformation always works 
because the overall grounding substitution of the initial refutation m is preserved 
by the transformation. It just needs to be extended to the extra variables added 
by freshly renamed copies of clauses. 

The above example shows the importance of keeping track of the occurrences 
of literals in a deduction. A labeled literal is a pair ML where M is a finite 
non-empty set of natural numbers called the label and L is a literal. We identify 
literals with labeled literals and refer explicitly to the label of a labeled literal by 
the function lb. The function lb is extended to clauses via union of the respective 
literal labels. We extend the notion of a clause to that of a labeled clause built on 
labeled literals in the straightforward way. We call a deduction my label-disjoint 
if the clauses from N in the deduction have unique singleton labels. Labels are 
inherited in a deduction as follows: in case of a resolution inference, the labels of 
the parent clauses are inherited and in case of the factoring inference, the label 
of the remaining literal is the union of labels of the factorized literals. 

In general, we need to identify the parts of a deduction that are already 
contained in an SOS deduction, this is called the partial SOS of a deduction, Def- 
inition 6. Then this information can be used to perform the above transformation 
on any deduction 7. 


Definition 6 (PSOS of a Deduction). Let m be a deduction from N W S, 
then the partial SOS (PSOS) O* of (x, N, S) is defined as O* = Uj", O, where 
O° = S, O+! = O'U{C;} provided Cj € x, Cj ¢ Ot and C; is either the factor 
of some clause in O* or the resolvent of two clauses in m where at least one parent 
is from O', and where O™ is such that there is no longer such a Cj int. 


The partial SOS is well-defined because the resulting O* is independent of 
the sequence O? used. For example, for the deduction m from N presented in 
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Figure 1 the set O* = {(5), (6), (7)} is the PSOS of (m, N, {5}). Next we present 
a criterion when the PSOS of a deduction actually signals an SOS deduction. 


Lemma 7 (SOS Deduction). Let O* be the PSOS of (nm, N, S). Then m is 
an SOS deduction if O* \ S =7\(NUS)\, i.e., all inferred clauses in m are 
contained in O*. 


Proof. Let nyus = [Ci,...,Cy] and [C},...,C%,] be the subsequence of Tyus 
with input clauses removed. Let O* be the PSOS of (7, N, S}. Then [C},...,C/,] = 
O* \ S = T \ (N U S) by assumption. We show that (N,S°) >to (N,S™) is 
an SOS derivation, following Definition 3 by induction on m. If m = 0 then 7 
only consists of input clauses and there is nothing to show. For the case m = 1, 
the clause C} is the result of a factoring inference from S$ or the result of a 
resolution inference from N U S such that at least one parent is in S as for 
otherwise C ¢ (O* \ S). So (N, S°) Sng (N, 9° U {C}}) is an SOS derivation. 
For the induction case, assume the property holds for i. If Cj}; is the result 
of a factoring inference, then its parent C” is contained in S* because other- 
wise C” € N because m being a deduction, and, therefore Cj}; ¢ (O* \ S), a 
contradiction. If Cj}; is the result of a resolution inference, then again all its 
parents are contained in N U S’ because 7 is a deduction. If both parents are 
from N, then Ci, ¢ (O* \ S), a contradiction. So, by the induction hypothesis, 
(N, S°) >hps (N, S*) >pres (N, S**") is an SOS derivation. 


The rest of this section is devoted to describing the transformation in detail. 
In the next section, we then prove the new completeness result for SOS. 

Let m be a label-disjoint deduction from NUS and let Cy € m be a clause of 
minimal index such that C; is the result of a resolution inference from clauses 
Cj € O* and C; ¢ (N U O*). Let 7 be an overall ground substitution for 7. 
We transform 7 into 7’ by changing the deduction of C; such that the overall 
deduction gets “closer” to an SOS derivation and preserves 7. Let 


CO; =CiVL 
Ci =0VK (1) 
Cr = (CLV Cho 


where o = mgu( K, comp(L)). Without loss of generality we assume that 


T = [Ci,... Ors Cael Oe Ching] (2) 


where [C),..., Cj] and [Ci41,..., Cj] are subdeductions of m, and the prefixes 
of these sequences are exactly the introduced renamed copies of input clauses from 
N that are used to derive C; and Cj, respectively. The transformed derivation 
will be 


ed [Oha Or nee O iag Diye Diy Oae Cal (3) 
where 


4 Here we refer to the removal of all input clauses from O* and 7, respectively. 
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(a) the subsequences [C?,,,...,C?] are freshly variable-renamed copies of the 


sequence [Ci41,...,Cj] where m = |lb()|. For the copies [C?,,,...,C?] we 
keep the labels of literals of the original sequence [Ci+1,...,C,] for reference 
in the transformation. The clauses C? are decomposed into Ci? V L’, in 
the same way that the clause Cj is decomposed into Cj V L. Thus, for 
each clause from N in the sequence [C1,...,Cj;] containing a literal K’ with 
Ib(K") C lb(KĶ) we add a deduction deriving a renamed copy of C}; let 6? 
be the renaming substitution from the old to the freshly renamed sequence, 
then we extend 7 to 7’ as follows: 7% = T, 14, = Tp O {£P} WH t| x€ 
dom(d?*"),t = ar} for 1 < p < m yielding the overall new grounding 
substitution T’ = Tj, for 7’; 

(b) the clauses D,,..., Dı are generated by simulating the deduction [C1,..., Ci] 
eventually producing Ck, up to possible variable renamings: Let Cp be the cur- 
rent clause out of this deduction and let D,,...,D, be the clauses generated 
so far until Cp_1; 


(i) if Cp is an input clause not containing a literal K’ with lb(K’) C Ib(K), 
then Dg+1 = Cp and we associate Dg41 with Cp; 

(ii) if Cp is an input clause containing a literal K’ with Ib(K') C Ib(K), then 
Dg41 = Cy and Dy+2 is the resolvent between Dq+1 and a so far unused 
clause C? on the literals A’ € Dg41 and L’ € C? where lb(K') C Ib(K) 
and lb(L’) = lb(L) and we associate Dj+2 with Cp; 

(iii) if Cp is the resolvent between two clauses Cy, Cj, then we perform the 
respective resolution step between the associated clauses and respective 
associated literals from Dg, Dg” yielding Dg+1 and associate Dy+1 with 


Cp; 
(iv) if Cp is the factor on some literal A’ with lb( K’) C 1b(K), then we perform 
the respective factoring steps Dj+1,...,D 4s for respective literals with 


labels from Cj, where s = |C}| and we associate Dq+s with Cp, 

(v) if C, is the factor on some literal K’ with lb(K’) Z Ib(K), then we perform 
the respective factoring step on the respective literals with identical labels 
from clause Dy yielding Dj; and we associate Dgi1 with Cp; 


(c) the clauses Ci,,,,-..,C}, are obtained by simulating the generation of clauses 
Ck+1,---, Cn where Ck is substituted with D). 


Note that by assumption, the generation of clauses Cy41,...,Cn does not 
depend on clauses C),...,Ci, Ci4i,...,Cj but only on Ck and the input clauses. 
We will prove that Cr = Cyr’ = Dit’ which is then sufficient to prove C,T = 
Cyt’ = Clr’ and for the above to be well-defined. In general, the clause D; is 
not identical to C;, because we introduce fresh variables in 7’ and do not make 
any specific assumptions on the unifiers used to derive D). 


Mapping the transformation to our running example, Figure 1: C; = (7), 
Ci = (4), and Ck = (8). We need two copies of (7) because K = {1,4}7Q(b, f(a)) 
so m = |{1,4}] = 2 and L = {6}Q(a1, f (x6)). 


336 F. Haifani et al. 
3 A Generalized Completeness Proof for SOS 


In this section, we prove that repeated applications of the transformation intro- 
duced in the previous section can actually transform an arbitrary deduction into 
an SOS deduction, given that at least one clause from the SOS occurs in the 
original deduction. Firstly, we show that associated clauses of the transformed 
deduction preserve main properties of the original deduction. The extended sub- 
stitution is identical to the original substitution on old clauses and the changed 
part of the deduction ends in exactly the same clause. 


Lemma 8 (Properties of Associated Clauses). Let Cj, Ci, Ck, L, K, 7, 
mt’, T, T' be as defined in (1), (2), and (8), page 334. For each clause C out of 
[C1,...,C;] and clause D associated with C: 


1. Cr=Cr', 

2. K'Y = L't' if lb(K') = lb(L’) for any K', L’ occurring in either t or n’, 

3. Ib(C) \Ib(£) = Ib(D) \ Ib(C 7?) and Ib(C%?) C Ib(D) if there is K' € C with 
bK’) C Ib(K), 

4. Cr\ {K'r € C | lb(K") C Ib(K)} = Dr’ \ {L'r € Dr | Ib(L’) € Ib(C/?)} 
and CPT" C Dr if there is K' € C with Ib(K") C Ib(Kk), 

5. Cet = Dır. 


Proof. 1. By definition of 7’ the additional variables in 7’ do not occur in C 
while 7’ is identical to 7 on the variables of C, hence Cr = Cr’. 


2. By induction on the generation of n’. For the base case, every literal occurring 
in NUS has a unique label and any renamed clause C9, for some Cm € (NUS) 
has the labels kept. So, for any two literals K’ and L’ in any non inferred clauses 
in v and 7’, K’r’ = L’r’ when the labels are equal. For the induction step, for 
inferred clauses, Ib(A’) = lb(L’) happens when the label of K’ is inherited from 
L’ through an inference. The inference uses an mgu which is compatible with 7’ 
due to r’ being an overall ground substitution, so K’r’ = L'r’. 


3. We prove this property by induction on the length of the derivation [C),..., Cj]. 
Let C = Cp, 1 < p <i, and let D1,..., Dq be the clauses generated until Cp—1 
for which, by the induction hypothesis the property already holds. 


(i) If C is an input clause not containing a literal K’ with lb(K') C Ib(K), we 
have C = Cp = Dayı = D and {K € C7 | lb(K') C lb(K)} = {L € Dg’ | 
Ib(L’) C Ib(C#?)} = 4. 

(ii) If C is an input clause containing a literal K’ with Ib(’) C Ib(K) then 
D = D442 results from a resolution inference between C = Cp and an unused 
C? on the literals K’ and I’ € C? with Ib(L’) = Ib(L). Let C=C’ v kK". 
Then Dr’ = (C' V C{?)r' and hence Ib(C) \ Ib(£) = Ib(D) \ Ib(C/?) because 
Ib(C) N Ib(C?) = 0 as 7 is a label-disjoint deduction and Ib(C;) = Ib(C?) by 
construction. 

(iii) If C is a resolvent of Cy = Cj, V Li, and Cj = Cy V Li, on literals Ly, L'y, 
then Crt = CyT V CyT, and D 41 is a resolvent of some Dy = Dy V Lọ 
and Dy = Dyn V Lin associated with Cy and Cj, respectively. We have 
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Ib(Lj,) = Ib(Lj,,) and Ib(Li,) = Ib(Lj,,) and none of these literals has a 
label from Ib(/¢) or Ib(C/?). Hence, the conjecture holds by the induction 
hypothesis. 

If C results from a factoring on K’ from C,_1, we get Dq+s by a sequence 
of s factoring inferences from D +1 associated with Cp_1. Any factorings 
on Cy_; and D,+; do not change literal labels because we factorize literals 
of identical label. So, this property holds by the induction hypothesis. This 
holds regardless of whether lb(K’) C Ib(K). 


— 
= 
< 

Sy 


4. From Lemma 8.3 we know that Ib(C) \ Ib(/) = Ib(D) \ Ib(C4?) and Ib(C#?) C 
Ib(D) if there is K’ € C with Ib(K’) C Ib(K). Since the labels coincide, using 
Lemma 8.2, we have CT’ \ {K € CT | lb(K’) C lb(K)} = Dr’ \ {L’ € Dr | 
Ib(L’) € Ib(Cf?)} and Cf?r' C Dr if there is K’ € C with lb(K’) C lb(c). This 
hypothesis holds by applying Lemma 8.1 on literals and clauses from 7 in the 
equation. 


5. The clause C% is the result of a resolution inference between C; and Cj upon 
K and L: Cyt = Cit UC}. By translation and because {K € C; | Ib(K’) C 
lb(K)} = {K}, the clause C; is associated with D; € n’ and Cir \ {Kr} = 
Dır’ \ {L € Dir | Ib(L’) € lb(C%?) }. Since Cir! = Cir = Cyr \ {Lr}, we have 
{L" € Dit’ | Ib(L") C Ib(L") for some L’ € C7} = Dit'NCF\ {Lr} = C3 \ {Lr}. 
So C; \ {K7} = Dir \ (Dir A C; \ {L£7}) = Dir \ (C; \ {L£7}). We can add 
Cjr \ {Lr} to both sides and get Cyr = Cit U Cyt \ {K7, Lr} D Dir. In 
addition, since Ib() C lb(K), this means Cjr = Ci?r' C DT’. Therefore 
Cra Crue \ {Kr7, Lr} = Dir. 


Next we need a well-founded measure that decreases with every transformation 
step and in case of reaching its minimum signals an SOS deduction. Given a 
clause set N and an initial SOS S, the SOS measure of a deduction 7 is u(r) where 
MT) = X oer (Ci, T) and (Ci, m) = 0 if C; E€ NU O* otherwise (Ci, T) = 1. 


Lemma 9 (Properties of u). Given a clause set N, an initial SOS S, and a 
deduction m that contains at least one resolution step, 


1. u(t) > 0, and 
2. if u(n) =0 then m is an SOS deduction. 


Proof. 1. Obvious. 
2. Towards contradiction, suppose m = [Ci,...,C,] is not an SOS deduction. 
This means O* \ SC 7 \ (NUS) by Lemma 7. Consider a clause C; € (a \ (N U 
S)) \ (O* \ S) of minimal index. Then C; must be the result of an inference on 
some C; and Cp such that both are not in O*. This means C; ¢ (N U O*). For 
this clause, js assigns a nonzero value: pu(C;,7) > 0. Therefore, u(r) 4 0. 


Next we combine the properties of associated clauses on one transformation 
step with the properties of the measure resulting in an overall deduction trans- 
formation that can be recursively applied and deduces the same clause modulo 
some grounding. 
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Lemma 10 (Properties of the Transformation). Given a deduction t of a 
clause Cn from NUS that contains at least one resolution step such that ra S 4 0, 
an overall ground substitution T of x and the transformed deduction n’ of a clause 
C!, as defined in (1), (2), and (3) with overall ground substitution r', we have: 


1. n’ is a deduction from NUS, 

2. Crt = Cir’, and 

3. y(n’) < (rr). 

Proof. 1. We show that 7’ is a deduction following Definition 2. These properties 


will be carried over from 7. Observe that, if mı is a deduction of Cp from NUS 
and 72 is a deduction from NUSU{C;,} using Cp only once, their concatenation 


Tı © T2 is a deduction from N U S. Firstly, the subsequences [C?,,,..., C$] are 
deductions of C? from N U S since they are only the renamed copies of the 
subdeduction [C;i+1,... C4] of m. Secondly, the subsequence [Ck,..., Cn] is a 


deduction of C,, from N U SU {Ck} since the clauses after C; do not use any 
clauses before Ck by the way m is represented as a sequence. Now, by showing 
that [Cj,...,C}", D1,..., Di, Cx] is a deduction of Cy from NUSU{C?} efi); 
the sequence [D1,..., Dı] would then connect the initial copied sequences and 
the tailing subsequence. Each C? is used for exactly one resolution inference 
producing some Dy, the other required clauses are copied, and the later resolution 
and factoring steps in [D1,..., Dı] are sound while the deduction properties of 
[C;,...,Ci] are preserved in its associated clauses: for an inference where Cy 
(and Cpr) generates Cp, we have a unique inference between their associated 
clauses Dy, (Dq,) Dg+1 where Dy (and Dgr) generates Dy+1, possibly with 
additional factoring inferences in between. If Cp is an input clause not containing 
a literal K’ with lb(K’) C lb(K), then Dj41 = Cp € N. The clause D,+1 is used 
in 7’ as Cp is used in 7; if Cp is an input clause containing a literal K’ with 
lb(K') C lb(K), the resolution between Dj+1 and a so far unused clause C? is 
sound as K’ and comp(L’) are unifiable by 7’. Here, all C? will be eventually used 
as there are m = |lb(K)| literals in the clauses from N; if Cp is the resolvent 
between two clauses Cy,Cj then the respective resolution step between the 
associated clauses Dy, Dg” upon the respective associated literals K’ and L’ 
is sound because we can get K’'r'’ = comp(L’)r’ using Lemma 8; if Cp is the 
factor on some literal A’ with lb(AK‘) C Ib(K), then the respective factoring 
steps Dg+41,.-.,Dq+s are also sound: each pair of the s associated literals M and 
M’ from C? and ce are unifiable because Mr’ = M’'r'; if Cp is the factor of 
Cp-1 upon some literal K’ and L’ with {lb(K’), lb(L’)} Z lb(4), the respective 
factoring step on the associated clause Dg is also sound by Lemma 8. Therefore 
m is a deduction from NUS. 

2. By Lemma 8.5, CkT = Dit’. The derivation of clauses Ck, Ck+1,---, Cn only 
depends on the input clauses by assumption. By an inductive argument we get 
Ck+1T = Cy417' yielding Cut = Chr. 

3. The clauses in [C?,,,...,C?] have the measure 0 as their original ones in 
[Ci41,-.-,Cj] because they are in NUO*. The clauses in [Cy,...,C),] also retain 
their original measures. The clauses in [D;,..., Dı] are s.t. Xl ula’, Dk) < 
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Xi_ y(n’, Cy). More specifically, any C € [C),...,C;] that is not in NUO* (with 
measure p(C,7) > 1) and containing K’ with lb(A’) C Ib(K) is associated 
with D, € O* \ N having the measure ju(D,, 7’) = 0, while all other clauses in 
[Di,..., Dı] are either copied from 7 with the same measure as before or new in 
T’ but have the measure 0. 

By induction on the length of the sequence [C),..., Ci] we prove the following 
property: if D is associated with a clause C € [C),...,Cj;] and C contains some 
literal in {K | lb(K") C lb(K)}, then D € NUO* and p(D, 2’) = 0. Let C = Cp. 
Let D,,...,Dq be the clauses generated until Cp_; s.t. the property already 
holds. 


(i) If Cp is an input clause with no literals in {A’ | lb(’) C Ib(K)}, it is 
associated with D4 = Cp s.t. (Cp, T) = u(Dq, T") = 0; 

(ii) If Cp is an input clause containing {K | lb(A’) C Ib(K)}, it is resolved with 
some C? € O* resulting in Dy41 € O*. Here we have (Cp, 7) = u(Dq, 7’) = 
0; 

(iii) If Cp is the resolvent between two clauses Cy, Cj then we perform the 
respective resolution step between the associated clauses Dy, Dg yielding 
the clause D, associated with Cp. If either Cy or Cj, contains some literal 
from {K | Ib(K') C Ib(K)} then C, contains this literal as well and either 
Dy € O* or Dg € O* by the induction hypothesis. So, we get Dg E€ O* and 
u(Dg, T) = 0. Otherwise, (Dg, T’) = (Cp, T) = 1; 

(iv) If Cp is the factor of Cp—ı on some literal A’ with lb(A’) C lb(K), then 
we have the respective factoring steps Dj41,...,Dq4+s where Dg+1 is as- 
sociated with Cp_;. By the induction hypothesis, Dy,; € O*. Therefore 
Da+1;---, Doig E O* with P Daag, =O for 1<t< s; 

(v) If Cp is the factor of Cp—1 (associated with D,) on some literal K’ and L’ with 
{b(K'),Ib(L’)} Z Ib(K), the factoring happens to the associated clauses in 
n’ with similar measure. 


Finally, by the choice of C;, Cj, and Ck, there must exist at least one Cp with 
some literal from {K’ | lb(’) C Ib(K)} but associated with some D such that 
D € O* from case (iii) or (iv) before. This also means u(D, 7’) = 0. The clause 
C; has this property as it contains K. In addition, any C, has a nonzero measure 
because C; ¢ N U O* and C, is used to prove C;. Therefore, we have u(Cp, 7) > 
(D, 7’) = 0. As these clauses are never copied to 7’, u(n’) < ulr). 


Eventually, by an inductive argument we prove our main result. 


Theorem 11 (Generalized SOS Completeness). There is an SOS resolu- 
tion refutation from (N,S) if and only if there is resolution refutation from NUS 
that contains at least one clause from S. 


Proof. “=>”: Obvious: If there is no refutation from NUS using a clause S then 
there can also not be any SOS resolution refutation from (N, S). 


“<=”: Tf there is a deduction refutation 7 from NUS that contains at least one 
clause from S, then by an inductive argument on p it can be transformed into 
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an SOS deduction refutation with SOS S, and the result follows by Corollary 5. 
If u(r) = 0 then 7 is already an SOS deduction, Lemma 9. For otherwise, we 
transform the deduction 7 into a deduction 7’ according to (1), (2), and (3). A 
refutation always contains at least one resolution step, so by Lemma 10, 7’ is also 
a refutation from N U S and u(n’) < u(r). Eventually, 2’ can be transformed 
into a label-disjoint deduction by assigning fresh labels to all used clauses from 
NUS. 


As an example for the “=” direction consider the propositional logic clause 
set N = {P,P} and SOS S = {Q}. Obviously, there is no refutation of N U 
S using Q and there is no SOS refutation. Theorem 11 also guarantees that 
the consecutive application of the proof transformation steps (1), (2), and (3), 
page 334, results in an effective recursive procedure that transforms non-SOS 
refutations into SOS refutations. 


4 A new Notion of Relevance 


The idea of our notion of relevance is to separate clauses that are ultimately 
needed in a refutation proof called relevant, from clauses that are useful called 
semi-relevant, from clauses that are not needed called irrelevant. 


Definition 12 (Relevance). Given an unsatisfiable set of clauses N, a clause 
CEN is relevant if for all deduction refutations x of N it holds that C € a. A 
clause C E€ N is semi-relevant if there exists a deduction refutation n of N in 
which C € T. A clause C € N is irrelevant if there is no deduction refutation m 
of N in which CET. 


With respect to our example clause set N from Section 2 and its refutation, 
Figure 1, clause (5) is semi-relevant but not relevant, because the clauses (1), (2), 
(6), (9), (11) are already unsatisfiable. The clauses (1), (2), (6), (9), (11) are all 
relevant. 


Lemma 13 (Relevance). Given an unsatisfiable set of clauses N, the clause 
CEN is relevant if and only if N \ {C} is satisfiable. 


Proof. Obvious: if N\ {C} is satisfiable there is no resolution refutation and since 
N is unsatisfiable C must occur in all refutations. If C occurs in all refutations 
there is no refutation without C so N \ {C} is satisfiable. 


Lemma 14 (Semi-Relevance Test). Given a set of clauses N, and a clause 
CEN, C is semi-relevant if and only if (N\{C}, {C}) > pag (N\{C}, SU{L}). 


Proof. If (N\{C}, {C}) +hrg (V\{C}, SU{L}) then we have found a refutation 
containing C. On the other hand, by Theorem 11, Lemma 7 and Corollary 5, if 
there is a refutation containing C’, then there is also an SOS refutation with SOS 


{C}. 
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An immediate consequence of the above test and completeness of resolution 
for first-order logic is the following corollary. 


Corollary 15 (Complexity of the Semi-Relevance Test). Testing semi- 
relevance in first-order logic is semi-decidable. It is decidable for all fragments 
where resolution constitutes a decision procedure. 


Fragments where our semi-relevance test is guaranteed to terminate are for 
example first-order fragments enjoying the bounded model property, such as the 
Bernays-Schoenfinkel fragment [3]. 


5 Conclusion 


We have extended and sharpened the original completeness result for SOS resolu- 
tion [18], Theorem 11. The generalized SOS completeness result can actually be 
used to effectively test clauses for semi-relevance in case resolution constitutes a 
decision procedure for the respective clause set. This is for example the case for all 
fragments enjoying the bounded model property, such as the Bernays-Schoenfinkel 
fragment [3]. In general, our approach yields a semi-decision procedure for semi- 
relevance. 

Our proof is based on deductions having an a priori tree structure. However, 
this is not a principle restriction. It just simplifies the transformation introduced 
in Section 2: renamings have only to be considered on input clauses. In a setting 
where proofs forming directed acyclic graphs are considered, renamings have to be 
carried all over a deduction, adding further technicalities to our transformation. 

It is well-known that changing the ordering of resolution steps in a resolution 
deduction may exponentially increase or exponentially decrease the length of the 
deduction. Therefore, our transformation of a deduction into an SOS deduction 
may also yield an exponential growth in the length of the deduction. It may also be 
the other way round if, e.g, subsumption is added to the transformation. It is also 
not difficult to find examples where the transformation of Section 2 introduces 
redundant clauses. Recall that we have not made any assumption with respect to 
redundancy on deductions. So an open question is whether corresponding results 
hold on non-redundant deductions and what they actually mean for a respective 
notion of relevance. 

An open problem is the question whether a test for semi-relevance can be 
established with more restricted resolution calculi such as ordered resolution. In 
general, the SOS strategy is not complete with ordered resolution. However, it 
is complete with respect to a clause set saturated by ordered resolution. The 
technical obstacle here is that a saturated clause set may already contain the 
empty clause, because for our generalized completeness result and the respective 
relationship to semi-relevance, the set N may still be unsatisfiable without the 
clause C to be tested for semi-relevance. 
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Abstract. AVATAR is an elegant and effective way to split clauses in a 
saturation prover using a SAT solver. But is it refutationally complete? 
And how does it relate to other splitting architectures? To answer these 
questions, we present a unifying framework that extends a saturation 
calculus (e.g., superposition) with splitting and embeds the result in a 
prover guided by a SAT solver. The framework also allows us to study 
locking, a subsumption-like mechanism based on the current propositional 
model. Various architectures are instances of the framework, including 
AVATAR, labeled splitting, and SMT with quantifiers. 


1 Introduction 


One of the great strengths of saturation calculi such as superposition [1] is 
that they avoid case distinctions. Derived clauses hold unconditionally, and 
the prover can stop as soon as it derives the empty clause, without having to 
backtrack. The drawback is that these calculi often generate long, unwieldy 
clauses that slow down the prover. A remedy is to partition the search space by 
splitting a multiple-literal clause C1 V -+ -V Cn into variable-disjoint subclauses C4. 
Splitting approaches include splitting with backtracking [24], splitting without 
backtracking [20], labeled splitting [10], and AVATAR [22]. 

The SAT-based AVATAR architecture is of particular interest because it is 
so successful. Voronkov reported that an AVATAR-enabled Vampire could solve 
421 TPTP [21] problems that had never been solved before by any system [22, 
Sect. 9], a mind-boggling number. AVATAR works well in combination with 
the superposition calculus because it combines superposition’s strong equality 
reasoning with the SAT solver’s strong clausal reasoning. It is also appealing 
theoretically, because it gracefully generalizes traditional saturation provers and 
yet degenerates to a SAT solver if the problem is propositional. 


Example 1. To illustrate the approach, we follow the key steps of an AVATAR- 
enabled resolution prover on the initial clause set containing —p(a), ~q(z, z), and 
p(x) V q(y,b). The disjunction can be split into p(x) + {[p(x)]} and q(y,b) + 
{[a(y, b)]}, where C + {[C]} indicates that the clause C is enabled only in models 
in which the associated propositional variable [C] is true. A SAT solver is then 
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run to choose a model J of [p(x)] v [q(y, b)]. Suppose J makes [p(x)] true and 
[q(y, b)] false. Then resolving p(x) + {[p(a)]} with ap(a) produces L + {[p(a)]}, 
which closes the branch. Next, the SAT solver makes the right disjunct true, 
and resolving q(y, b) + {[q(y, b)]} with aq(z, z) yields L + {[q(y, b)]}. The SAT 
solver then reports “unsatisfiable,” concluding the refutation. 


What about refutational completeness? Far from being a purely theoretical 
concern, establishing completeness—or finding counterexamples—could yield 
insights and perhaps lead to an even stronger AVATAR. Before we can answer 
this open question, we must mathematize splitting. Our starting point is the 
saturation framework by Waldmann, Tourret, Robillard, and Blanchette [23], 
based on Bachmair and Ganzinger [2]. It covers a wide array of techniques, 
but “the main missing piece of the framework is a generic treatment of clause 
splitting” [23, p. 332]. We provide that missing piece, in the form of a splitting 
framework, and use it to show the completeness of an AVATAR-like architecture. 

Our framework has five layers, linked by refinement. The first layer consists 
of a refutationally complete base calculus, such as resolution or superposition. It 
must be presentable as an inference system and a redundancy criterion. 

From a base calculus, we derive a splitting calculus (Sect. 3). This extends 
the base calculus with splitting and inherits the base’s completeness. It works on 
A-clauses or A-formulas C < A, where A is a set of propositional literals. 

Using the saturation framework, we can prove the dynamic completeness of an 
abstract prover, formulated as a transition system, that implements the splitting 
calculus. However, this ignores a vital component of AVATAR: the SAT solver. 
AVATAR considers only inferences involving A-formulas whose assertions are 
true in the current propositional model. The role of the third layer is to reflect 
this behavior. A model-guided prover operates on states of the form (J, NV), where 
J is a propositional model and M is a set of A-formulas (Sect. 4). 

The fourth layer introduces AVATAR’s locking mechanism (Sect. 5). With 
locking, an A-formula D + B can be temporarily disabled by another A-formula 
C + A if C subsumes D, even if A Z B. Here we make a first discovery: AVATAR- 
style locking compromises completeness and must be curtailed. 

Finally, the fifth layer is an AVATAR-based prover (Sect. 6). This refines the 
locking model-guided prover of the fourth layer with the given clause procedure, 
which saturates an A-formula set by distinguishing between active and passive 
A-formulas. Here we make another discovery: Selecting A-formulas fairly is not 
enough to guarantee completeness. We need a stronger criterion. 

In a hypothetical tête-à-tête with the designers of labeled splitting, they might 
gently point out that by pioneering the use of a propositional model, including 
locking, they almost invented AVATAR themselves. Likewise, developers of 
SMT solvers might be tempted to claim that Voronkov merely reinvented SMT. 
To investigate such questions, we apply our framework to splitting without 
backtracking, labeled splitting, and SMT with quantifiers (Sect. 7). This gives us 
a solid basis for comparison as well as some new theoretical results. 

A technical report [8] is available with the proofs, several counterexamples, 
and further details. A formalization using Isabelle/HOL [16] is underway. 
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2 Preliminaries 


Our framework is parameterized by abstract notions of formulas, consequence rela- 
tions, inferences, and redundancy. We largely follow the conventions of Waldmann 
et al. [23]. A-formulas generalize Voronkov’s A-clauses [22]. 


Formulas. A set F of formulas is a set that contains a distinguished element L 
denoting falsehood. A consequence relation = C (P(F))* has the following 
properties for all M, N, P,Q C F and C, D € F: (D1) {L} H 4; (D2) {C} = {C}; 
(D3) if M C N and P C Q, then M f+ P implies N = Q; (D4) if M H P and 
N — QU{C} for every C € M and NU{D} E Q for every D € P, then N E Q. 
The intended meaning of M = N is AM — VN. From FE, we can easily derive 
a relation understood as AM — AN, as required by the saturation framework. 

The — notation can be extended to allow negation on either side. Let Fy 
be defined as F W {~C | C € FL} such that ~~C = C. Given M, N C FY, we 
have M } N if and only if {C EF|CeEeM}U{CEF|~CeEN}H{CEF] 
~CEM}U{CEF|CEN} 

Following the saturation framework [23, p. 318], we distinguish between the 
consequence relation = used for stating refutational completeness and a possibly 
stronger consequence relation - for soundness. We require that W is compact. 


Example 2. In clausal first-order logic with equality, the formulas in F consist of 
clauses over a signature ©. Each clause C is a finite multiset of literals Ly,..., Dn 
written C = Lı V---V Ln. Each literal L is either an atom or its negation (~=), 
and each atom is an unoriented equation s ~ t. We have M } N if and only if 
every -model of M also satisfies at least one clause in N. 


Calculi and Derivations. A refutational calculus (Inf, Red) combines a set of 
inferences Inf and a redundancy criterion Red. We refer to Waldmann et al. [23] 
for the precise definitions. Recall in particular that Inf(V) is the set of inferences 
from N, Inf(N,M) = Inf(N UM) \ Inf(N \ M), N is saturated w.r.t. Inf and 
Red, if Inf(N) C Red(N), and (Inf, Red) is statically (refutationally) complete 
(w.r.t. =) if L € N for every N H {1} saturated w.r.t. Inf and Redy. 

Let (X;); be a sequence of sets. Its limit inferior is Xæ = lim inf joo X; = 
U; Aji XG: and its limit superior is X° = limsup;_,,, X; = N; U;>: Xz- The 
elements of X» are called persistent. A sequence (N;); over P(F) is weakly fair 
wrt. Inf and Red if Inf (No) C U; Redi(Ni) and strongly fair if UInf(Ni))° C 
U; Redi(N;). Given a relation >, a >-derivation is an infinite sequence such that 
x; > 241 for every i. Finite runs can be extended to derivations via stuttering. 

Let Dreap C (P(F))? be the relation such that M brea, N if and only 
if M\ N C Redp(N). The calculus (Inf, Red) is dynamically (refutationally) 
complete (w.r.t. H) if for every >pea,-derivation (N;); that is weakly fair w.r.t. 
Inf and Redy and such that No H {L}, we have L € N; for some i. 


A-Formulas. We fix throughout a countable set V of propositional variables 
Vo,Vi,---- For each v € V, let ~v € ~V denote its negation, with =>v = v. We 
assume that a formula fml(v) € F is associated with each v € V. Intuitively, v 
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approximates fml(v) at the propositional level. This definition is extended so 
that fml(-v) = ~fmi(v). An assertion a € A = V U ~V is either a propositional 
variable v or its negation av. Given a formula C € FV, let asn(C) denote the set 
of assertions a € A such that {fmi(a)} $ {C} and {C} F {fml(a)}. 

A propositional interpretation J3 C A is a set such that for every v € V, 
exactly one of v € J and ~v € g holds. We reserve the letter J for interpretations, 
and define fml(J) = {fml(a) | a € J}. 

An A-formula over a set F of base formulas and an assertion set A is a pair 
C = (C, A) € AF = F x Pan (A), written C + A, where C is a formula and A isa 
finite set of assertions {a1,...,@,} understood as an implication a, ^- -^an —> C. 
We identify C + @ with C and define the projection |C + A| = C. Moreover, 
Na is the set consisting of all A-formulas of the form L < A € WN. We call such 
A-formulas propositional clauses. Note the use of calligraphic letters (e.g., C, NV) 
to range over A-formulas and sets of A-formulas. 

We say that C+-A € AF is enabled in J if A C J. A set of A-formulas is enabled 
in g if all of its members are enabled in J. The enabled projection Ng C |N] 
consists of the projections |C| of all A-formulas C enabled in J. Analogously, 
the enabled projection Inf C |Inf| of a set Inf of AF-inferences consists of the 
projections |:| of all inferences ¿ € Inf whose premises are all enabled in J. 

A propositional interpretation J is a propositional model of N, written 
JEN, if L ¢ (N1)g. Moreover, we write J & Mi if L ¢ (N_)g or fml(d) & {L}. 
A set N, is propositionally satisfiable if there exists an interpretation J such 
that J = N_. In contrast to consequence relations, propositional modelhood — 
interprets the set M, conjunctively: J H N is understood as 3 = AML. 

Finally, we lift | and W from P(F) to P(AF): M H N if and only if 
My — [N] for every J in which M is enabled, and M W N if and only if 
fml(J) U Mg W |N] for every J in which M is enabled. 


Example 3. In the original AVATAR [22], the connection between first-order 
clauses and assertions takes the form of a function | ] : F— A. The encoding is 
such that [AC] = —[C] for every ground unit clause C and [C] = [D] if and only if 
C is syntactically equal to D up to variable renaming. This can be supported in 
our framework by letting fmi(v) = C for some C such that [C] = v, for every v. 


3 Splitting Calculi 


Let F be a set of base formulas equipped with L, =, and &. The relation © is 
assumed to be nontrivial: (D5) Ø & Ø. Let A be a set of assertions over V and AF 
be the set of A-formulas over F and A. Let (FInf, FRed) be a base calculus for F, 
where FRed is a redundancy criterion that additionally satisfies (1) an inference 
is FRedy-redundant if one of its premises is FRedp-redundant; (2) L ¢ FRedp(N) 
for every N C F; and (3) C € FRedp({L}) for every C # L. These requirements 
can easily be met by a well-designed redundancy criterion [1, Sect. 4.3]. 

Below, we will define the splitting calculus induced by the base calculus. We 
will see that it not only is statically and dynamically complete w.r.t. H, but also 
meets stronger, “local completeness” criteria that capture model switching. 
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The Inference Rules. We start with the mandatory inference rules. 
Definition 4. The splitting inference system SInf consists of all instances of 


(C; + Ai), (be Ai) 
BASE ————— UNSAT 
De A, U- U Apn alt 


For BASE, the side condition is (C;,...,Ci1,D) € FInf. For UNSAT, the side 
condition is that {1 + Aj,...,1<- An} is propositionally unsatisfiable. 


In addition, the following optional inference rules can be used: 


CHA 
Le {7a1, it in} UA (Ci + Lata 


SPLIT 


(Le 4A; CHA (Le 4A;); ıı C}AUB 
COLLECT TRIM 
(Le Ai) (1e Ai) CB 
(Le Ai) CA 
——————— STRONGUNSAT ~~~ APPROX TAUTO 
L L4+{-a}UA CA 


The following side conditions apply. For SPLIT: C Æ L is splittable into C1,..., 
Cn and a; € asn(C;) for each i. A formula C is splittable into two or more 
formulas C),...,Cn if {C} W {01,...,Cn} and C € FRedp({C;}) for each i. 
For COLLECT: C # L and {1 + Ai}; W {L + A}. For TRM: C # L and 
{L e Ai} U {L + A} K {1L < B}. For StroncUnNsaT: {L + Ai} & {L}. 
For APPROX: a € asn(C). For TAUTO: W {C < A}. 

The three rules identified by double bars are simplifications; they replace 
their premises with their conclusions in the current A-formula set. The premises’ 
removal is justified by SRedp, defined below. Also note that BASE preserves the 
soundness of FInf w.r.t. F and that the other rules are sound w.r.t. }. 

The SPLIT rule performs an n-way case split on C. Each case C; is approxi- 
mated by an assertion a;. The first conclusion expresses that the case distinction 
is exhaustive. The n other conclusions assume C; if its approximation a; is true. 
In a clausal prover, typically C = C1 V---V Cn, where the subclauses C; have 
mutually disjoint sets of variables and form a maximal split. 

COLLECT and TRIM do some garbage collection. STRONGUNSAT is a variant 
of UNSAT that uses W instead of }. It might correspond to invoking an SMT 
solver [3] (FY) with a time limit, falling back on a SAT solver (}). APPROX can be 
used to make any derived A-formula visible to . TAUTO allows communication 
in the other direction, from the SAT solver to the calculus. 


Example 5. Suppose the base calculus is first-order resolution [2] and the initial 
clauses are —p(a), 7q(z, z), and p(x) V q(y, b), as in Example 1. SPLIT replaces 
the last clause by L + {~vo, =vı }, p(x) < {vo}, and q(y, b) + {vi}. Two BASE 
inferences then generate L + {vo} and L «+ {v1}. Finally, UNSAT generates L. 
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The Redundancy Criterion. Next, we lift the base redundancy criterion. 


Definition 6. The splitting redundancy criterion SRed = (SRedy, SRedp) is 
specified as follows. An A-formula C + A € AF is redundant w.r.t. N, written 
CHAE SRedp(N), if (1) C € FRedp (N3) for every propositional interpretation 
J > A or (2) there exists an A-formula C + B € N with B C A. An inference 
ı € SInf is redundant w.r.t. N, written 1 € SRed;(N), if (1) ¿ is a BASE inference 
and {i}g C FRedı(N3) for every J or (2) ¿ is an UNSAT inference and L € M. 


SRed qualifies as a redundancy criterion. It can justify the deletion of A- 
formulas that are propositionally tautological. It also allows other simplifications, 
as long as the assertions on A-formulas used to simplify a given C + A are 
contained in A. If the base criterion FRedp supports subsumption, this also 
extends to A-formulas: D + B € SRedp({C + A}) if D is strictly subsumed by C 
and B D A, or if C= D and B D A. 


Local Saturation. It is not difficult to show that if (FInf, FRed) is statically 
complete, then (SInf, SRed) is statically and hence dynamically complete. How- 
ever, this result fails to capture a key aspect of most splitting architectures. 
Since >gred,-derivations have no notion of current split branch or model J, they 
must also perform disabled inferences. To respect enabledness, we need a weaker 
notion of saturation. If an A-formula set is consistent, it should suffice to saturate 
w.r.t. a single propositional model. In other words, if no A-formula L + A C J is 
derivable for some model J H M1, the prover should be allowed to give a verdict 
of “consistent.” We will call such model-specific saturations local. 


Definition 7. A set N C AF is locally saturated w.r.t. SInf and SRed, if either 
L €N or there exists J H M, such that Ng is saturated w.r.t. FInf and FRedy. 


Theorem 8 (Strong static completeness). Assume (FInf, FRed) is stati- 
cally complete. Given a set N C AF that is locally saturated w.r.t. SInf and 
SRed; and such that N — {1}, we have L E€ N. 


Example 9. Consider the A-clause set {L 4+ {=[p(x)], =[q(y)]}, p(x) < {[p()]}, 
aly) + {[q(y)]}, -q(a)} expressed using AVATAR conventions. It is not saturated 
for resolution, because the conclusion L <+ {[q(y)]} of resolving the last two 
A-clauses is missing, but it is locally saturated with J D {[p(x)], >[q(y)]}. 


Definition 10. A sequence (N;); of sets of A-formulas is locally fair w.r.t. SInf 
and SRed; if either L € M; for some i or there exists J H (No). such that 
FInf ((Noo)a) © U; FRedi((Ni)a). 


Theorem 11 (Strong dynamic completeness). Assume (FInf, FRed) is 
statically complete. Given an >gpreap-derivation (Nj); that is locally fair w.r.t. 
SInf and SRed; and such that No } {L}, we have L € N; for some i. 


In Sects. 4 to 6, we will review three transition systems of increasing complexity, 
culminating with an idealized specification of AVATAR. They will be linked by a 
chain of stepwise refinements, like pearls on a string. All derivations using these 
will correspond to >gred,-derivations, and their fairness criteria will imply local 
fairness. Consequently, by Theorem 11, they will all be complete. 
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4 Model-Guided Provers 


AVATAR and other splitting architectures maintain a model of the propositional 
clauses, which represents the split tree’s current branch. We can capture this 
abstractly by refining >srea,-derivations to incorporate a propositional model. 

The states are now pairs (J, M), where J is a propositional model and NV C AF. 
Initial states have the form (J, N), where N C F. The model-guided prover MG 
is defined by the following transition rules: 


DERIVE (J, N Y M) =>muc (J, N 8 M’) if M C SRedp(N WM’) 
SWITCH (J, N) =me (J, N) fJ HEN 
STRONGUNsAT (J, N) =>me (J, N U {L}) if ML & {1} 

From an = >mc-derivation, we obtain an Dsrea,-derivation by simply erasing 
the J components. The DERIVE rule can add new A-formulas and delete redundant 


A-formulas. J should be a model of M, most of the time; when it is not, SWITCH 
can be used to switch model or STRONGUNSAT to finish the refutation. 


Example 12. Let us revisit Example 5. Initially, let Jo = {-vo, >v,}. After the 
split, we have sp(a), ~q(z, z), p(x) +} {vo}, aly, b) + {vi}, and L + {aVvo, avi}. 
The natural option is to switch model. We take J; = {vo, >vi}. We then derive 
L + {vo}. Since Jı - L 4 {vo}, we switch to J2 = {-vo,vi}, where we derive 
L + {vı}. Finally, we detect that the propositional clauses are unsatisfiable. 


We need a fairness criterion for MG that implies local fairness of the underlying 
©SRedp-derivation. The latter requires a witness J but gives us no hint as to 
where to look for one. Our solution involves a topological concept: J is a limit 
point in (Ji); if there exists a subsequence (Ji); of (Ji); such that J = Jh =d'™. 


Example 13. Let (J;); be the sequence such that Jo; N V = {v1,v3,.--,Vai—-1} 
(ie., V1,V3,---,Vgi—-1 are true and the other variables are false) and Jzi+1 = 
(dai \{>vai})U {vai}. Although it is not in the sequence, the interpretation JN V = 
{v1,v3,---} is a limit point. The associated split tree is shown in Fig. 1. The 
direct path from the root to a node J; specifies the assertions that are true in 4i. 


Example 14. Let (J;); be such that Jo N V = 0, daiz1 N V = {vo} U {vaj4s | 
Í < i}, Jai+2 OV = {vo, Vai+2} U {vaj+3 | J < th, Jaits OV = {vaj41 | j < i}, and 
daiza O V = {vaj41 | J < i} U {v4i+4}. This sequence has two limit points: J’ = 
lim infi J4i+1 and J” = lim infi Jai+3. The split tree is depicted in Fig. 2. 


Basic topology tells us that every sequence has a limit point. No matter how 
erratically the prover switches branches, it will fully explore at least one of them. 
It then suffices to perform the base F'Jnf-inferences fairly in that branch: 


Definition 15. An = +ye-derivation (Ji, N;)i is fair if either (1) L € M for 
some i or (2) J; H (Ni) for infinitely many indices 7 and there exists a limit 
point J of (Ji); such that FInf((Noo)3) C U; FRed1((Ni)s)- 


Fairness of an = >mc-derivation implies local fairness of the underlying bsreap- 
derivation. A well-behaved propositional solver, as in labeled splitting, always 
gives rise to a single limit point Je, which can be taken for J in Definition 15. 
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Fig. 1: A split tree with a Fig. 2: A split tree with two infinite 
single infinite branch branches 


By contrast, an unconstrained solver, as supported by AVATAR, can produce 
multiple limit points. Then it is more challenging to ensure fairness. 


Example 16. Consider the consistent set consisting of sp(a), p(a) V q(a), and 
aq(y) V p(f(y)) V q(f(y)). Splitting the second clause into p(a) and q(a) and 
resolving q(a) with the third clause yields p(f(a)) V q(f(a)). This process can be 
iterated. Now suppose that va; and v2;41 are associated with p(f‘(a)) and q(f‘(a)), 
respectively. If we split every emerging p(f’(a))Vq(f*(a)) and the SAT solver always 
makes və; true first, we end up with the situation of Example 13 and Fig. 1. For 
the limit point J, all F/nf-inferences are performed. Thus, the derivation is fair. 


Example 17. We build a clause set from two copies of Example 16, where each 
clause C from each copy i € {1,2} is extended to =r; V C. We add the clause 
rı Vrg and split it as our first move. From there, each branch imitates Example 16. 
A SAT solver might jump back and forth, as in Example 14 and Fig. 2. Even 
if A-clauses get disabled and re-enabled infinitely often, we must perform all 
nonredundant inferences in at least one of the two limit points (J’ or J”). 


5 Locking Provers 


Next, we refine the model-guided prover into a locking prover that temporarily 
locks away A-formulas that are redundant locally w.r.t. some J but not globally. 
The states are triples (J, M, £), with £ C Pan (A) x AF. Intuitively, (B, C—A)€ £ 
means that C < A is “locally redundant” in interpretations J D B. The function 
|| || erases the locks: ||£]| = {C | (B,C) € £ for some B}. Initial states have the 
form (J, N,0), where N C F. The locking prover is defined by these two rules: 


LIFT (3,N, L) = 1 (J, N” U |u], 2 \U) 

if (J,N) =e (J, N’) and U ={(B,C+ A)E L| BE J'andACcI'} 
Lock (J,VW{C + A}, £) = 1 (J3,N,LU {(B,C + A)}) 

if B C J and C € FRedp (Ng) for all J’ D AUB 


We note that == -derivations refine = +yc¢-derivations, with states (3, M, £) 
mapped to (J, NV U |L]. 
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Locking can cause incompleteness, because an A-formula can be locally 
redundant at every point in the derivation and yet not be so at any limit point, 
thereby breaking local saturation. For example, if we have derived p(x) < {-vz,} 
for every k, then p(c) is locally redundant in any J that contains -v,. For 
the models J; = {v1,...,Vi, 2Vi+1,---}, the clause p(c) would always be locally 
redundant and ignored. Yet p(c) might not be locally redundant at the unique 
limit point J = V. We could rule out this counterexample by requiring that 
derivations are strongly fair—that is, every inference possible infinitely often 
must eventually be made redundant. However, we have found a counterexample 
showing that strong fairness does not ensure completeness [8, Example 46]. It 
would seem that this counterexample could arise with Vampire if the underlying 
SAT solver produces this specific sequence of interpretations. 

Our solution is as follows. Let (Ji, Mi, Li); be an = .-derivation, let (Jj) j 
be a subsequence of (d;)i, and let (Nj); be the corresponding subsequence of 
(N;);. To achieve fairness, we now consider MNZ, the A-formulas persistent in the 
unlocked subsequence (N;);. By contrast, fairness of —+e-derivations used Moo. 


Definition 18. An => -derivation (J;,Nj, Li); is fair if either (1) L € U; M; or 
(2) di H (Ni). for infinitely many indices 7 and there exists a subsequence (34); 
converging to a limit point J such that Finf((N3,)g U ((limsup;_,.[|L4lI)a \ 
IL’ Ia) CU; FRedi((MiU ||£ill)3), where (Vi); and (£5); correspond to (/;);. 


Fairness of an => _-derivation implies fairness of the corresponding =—>mG- 
derivation. The condition on the sets Li ensures that inferences from A-formulas 
that are locked infinitely often, but not infinitely often with the same lock, are 
redundant at the limit point. In particular, if we know that each A-formula is 
locked at most finitely often, then limsup,_,.,||£%|| = ||£’°°|| and the inclusion 
in the definition above simplifies to FInf((Ni)a) © U; FRedi((N; U (£il) a). 


6 AVATAR-Based Provers 


AVATAR was unveiled in 2014 by Voronkov [22]. Since then, he and his colleagues 
studied many options and extensions [3,17]. A second implementation, in Lean’s 
super tactic, is due to Ebner [9]. Here we attempt to capture AVATAR’s essence. 

The abstract AVATAR-based prover we define in this section extends the 
locking prover L with a given clause procedure [13]. A-formulas are moved in 
turn from the passive to the active set, where inferences are performed. The 
heuristic for choosing the next given A-formula to move is guided by timestamps 
indicating when the A-formulas were derived, to ensure fairness. 

Let TAF = AF x N be the set of timestamped A-formulas. Given N C TAF, 
we define [NJ = {C | (C,t) € N for some t}, and we overload existing notations 
to erase timestamps. Thus, |N] = N$], Ni = Nf, and so on. Note that 
we use a new set of calligraphic letters (e.g., C, N) to range over timestamped 
A-formulas and A-formulas sets. Using the saturation framework [23, Sect. 3], 
we lift (SInf, SRed) to a calculus (TSInf, TSRed) on TAF with the tiebreaker 
order > on timestamps, so that (C,t + k) € TSRedp({(C,t)}) for any k > 0. 
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A state is a tuple (J,A,?,9,£) € P(A) x P(TAF)? x P(Pan(A) x TAF), 
where A, P, and Q are respectively the sets of active, passive, and other (disabled 
or propositional) timestamped A-formulas, and £ is the set of locked time- 
stamped A-formulas such that (1) A, = P, =9, (2) AUP is enabled in J, and 
(3) Q3 C {L}. The AVATAR-based prover AV is defined as follows: 


INFER (J, A, PW {C},9,£) = av (J, AU {C}, P,Q, £) 
if TSInf(A,{C}) C TSRed(A U {CUP UQ), PCF, 
and QC Q’ 
PROCESS (3, A, P, 9, £) =x (J3, A', P’, 9’, £) 
if ADA’ 
and (A\A’) U (P\P’) U (Q\9') C TSRedp(A’ U P'U 9’) 
SWITCH (J, A, P,2,£) =x (J,A, P U Ul, 9",£\ UW) 


if J H Q1, J H921, ÆA ={CEA|C is enabled in J’}, 
uU =4{(B,(C}-4A,t))E€£|BZJ and ACJ}, and 
AUPUQ=A'UP UD! 
STRONGUNSAT (J,A,?,9,£) = av (J,A, P, QU {(L,t)},£) fQ RL 
LOCKA (J3, AW {(C HA, t)}, P, Q, £) =a 
(J,A, P, Q, £ U {(B, (C + A, t))}) 
if B C J and C € FRedp((AUP)g-) for every J’ D AUB 


There is also a LOCKP rule that is identical to LOCKA except that it starts in 
the state (J3, A, P w {(C + A, t)}, Q, £). An AV-derivation is well timestamped if 
every A-formula introduced by a rule is assigned a unique timestamp. 

Let (Ji, Ai, Pi, Qi, £i)i be an =—>av-derivation. It is easy to see that it refines 
the => -derivation (J;, A; U Pi U Q;f, (£:if); and that the saturation invariant 
TSInf(A;) C TSRedyı(A; U P; U Q; U || ZL; ||) holds if Ao = 0. 

In contrast with nonsplitting provers, for AV, fairness w.r.t. formulas does 
not imply fairness w.r.t. inferences. A problematic scenario involves two premises 
C, D of an inference ų and four transitions repeated forever, possibly with other 
steps interleaved: INFER makes C active; SWITCH disables it; INFER makes D 
active; SWITCH disables it. Even though € and D are selected in a strongly fair 
fashion, ų is never performed. We need an even stronger fairness criterion. 


Definition 19. An =>ay-derivation (J;, Ai, Pi, Qi, £:); is fair if (1) LE lU; 2:5 
or (2) Ji H (Q;)1 for infinitely many indices 7 and there exists a subsequence (J;;) 
converging to a limit point J, such that (3) liminf;... TSInf (A4, Pi) = and 
(4) (limsup,_, [45 lJ) \ LES] E U; PRedp((Ai U Pj U9; U (Leila). 


Condition (3) ensures that all inferences involving passive A-formulas are 
redundant at the limit point. It would not suffice to require Ph, = Ø because 
A-formulas can move back and forth between A, P, and Q, as we just saw. 
Condition (4) is similar to the condition on locks in Definition 18. If the =>,y- 
derivation is fair, the corresponding => _-derivation is also fair. 

Many selection strategies are combinations of basic strategies, such as choosing 
the smallest formula by weight or the oldest by age. We capture such strategies 
using selection orders <. Intuitively, € < D if the prover will always select € 
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before D if both are present. We use two selection orders: <par, based on 
timestamps, must be followed infinitely often; <p must be followed otherwise. 
For the first one, we can use <age defined so that (C,t) <age (C’,t’) if t <t. 


Definition 20. Let X be aset. A selection order < on X is an irreflexive and 
transitive relation such that {y | y # x} is finite for all z € X. 


The intersection of two orders <; and <2 corresponds to the nondeterministic 
alternation between them. The prover may choose either a <,-minimal or a 
< -minimal A-formula, at its discretion. 

To ensure completeness, we must restrict the inferences that the prover may 
perform; otherwise, it could derive infinitely many A-formulas with different 
assertions, causing it to switch between two branches of the split tree without 
making progress. Given N C AF, let [N] = {A| C+} AEN for some C}. 


Definition 21. A function F : P(AF) > P(AF) is strongly finitary if |F(N)| 
and U[F(N)] \U/N] are finite for any M C AF such that |N] is finite. 


Intuitively, a strongly finitary function F returns finitely many base formulas 
and finitely many new assertions, although it may return infinitely many A- 
formulas. Clearly, F (M) is finite for any finite M C AF. If FInf(N) is finite for any 
finite N C F, then performing SJnf-inferences is strongly finitary. Deterministic 
SPLIT rules, such as AVATAR’s, are also strongly finitary. We can lift a strongly 
finitary F to any N C TAF by taking Frar(N) = F((NJ) x N. If F and G are 
strongly finitary, then so is M => F(N) U G(N). 

Simplification rules used by the prover must be restricted even more to ensure 
completeness, because they can lead to new splits and assertions. For example, 
simplifying p(x * 0) V p(x) to p(0) V p(x) transforms an unsplittable clause into 
a splittable one. If simplifications were to produce infinitely many such clauses, 
the prover might split and switch models forever without making progress. 


Definition 22. Let < be a well-founded relation on F, and let < be its reflexive 
closure. A function S : AF > P(AF) is a strongly finitary simplification bound 
for < if N > Ucey S(C) is strongly finitary and [C’| < |C] for all C’ € S(C). 


The prover may simplify an A-formula C to C’ only if C’ € S(C). It may also 
delete C. Strongly finitary simplification bounds are closed under unions, allowing 
the combination of simplification techniques based on <. For superposition, a 
natural choice for < is the clause order. The key property of strongly finitary 
simplification bounds is that if we saturate a finite set of A-formulas w.r.t. 
simplifications, the saturation is also finite. 


Example 23. Let F be the set of first-order clauses and S(C < A) = {C’< A’ | 
C’ is a subclause of C and A’ C A}. Then S is a strongly finitary simplification 
bound. This S covers many simplification techniques, including elimination of 
duplicate literals, deletion of resolved literals, and subsumption resolution. 


Example 24. If the Knuth—Bendix order [12] is used and all weights are positive, 
then S(C + A) = {C’< A’ | Œœ < Cand A’ C A} is a strongly finitary 
simplification bound. This can be used to cover demodulation. 
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Equipped with the above definitions, we introduce a fairness criterion that is 
more concrete and easier to apply than fairness of => ,y-derivations. We could 
refine AV further and use this criterion to show the completeness of an imperative 
procedure such as Voronkov’s extended Otter loop [22, Fig. 3], thus showing that 
Vampire with AVATAR is complete if locking is sufficiently restricted. 


Lemma 25. Let I be a strongly finitary function, and let S be a strongly fini- 
tary simplification bound. Then a well-timestamped => y-derivation (Jj, Ai, Pi, 
Qi, £i)i is fair if all of the following conditions hold: 


1. <qar is a selection order on |]; ?;, and <p is a selection order on F; 

2. Ao = £o = Ú and Po U Qo is finite; 

3. for every INFER transition, either € is <rap-minimal in P or |C| is <p- 
minimal in |P]; 

. for every INFER transition, P’ U Q’ C Irap (AU {C}); 

. for every PROCESS transition, P U Q’ C Spar(AUPU QU |L£II); 

if Ji K (Qi), then eventually SWITCH or STRONGUNSAT occurs; 

. if P; 40, then eventually INFER, SWITCH or STRONGUNSAT occurs; 

. there are infinitely many indices i such that either P; = 0 or INFER chooses 
a <par-minimal C at i; 

9. (lim sup; soo l2 l)a \ LE ls C U; FRede((Ai U Pi U Q: U||Lil))) for every 


subsequence converging to a limit point. 


7 Application to Other Architectures 


AVATAR may be the most natural application of our framework, but it is not 
the only one. Below we complete the picture by studying splitting without 
backtracking, labeled splitting, and SMT with quantifiers. 


Splitting without Backtracking. Before AVATAR, Riazanov and Voronkov 
[20] had already experimented with splitting in Vampire in a lighter variant 
without backtracking. They based their work on ordered resolution O with 
selection [2]. Weidenbach [24, end of Sect. 4.5] independently outlined the same 
technique. The basic idea is to extend the signature © with a countable set P 
of nullary predicate symbols and to augment the base calculus with a binary 
splitting rule that replaces a Up-clause CV D with two Up-clauses C V p and DV 7p. 
Riazanov and Voronkov require that the precedence < makes all P-literals smaller 
than the &-literals. Binary splitting is then a simplification. They also extend 
the selection function of the base calculus to support P-literals. Their parallel 
selection function imitates as much as possible the original selection function. 
The calculus Op is closely related to an instance of our framework. Let F be 
the set of 5-clauses, with the empty clause as L. Let O = (FInf, FRed) be the 
base calculus. We take V = P. Let LA = (SInf, SRed), whose name stands for 
lightweight AVATAR, be the induced splitting calculus. Lightweight AVATAR 
amounts to the splitting architecture Cruanes implemented in Zipperposition |7, 
Sect. 2.5]. Binary splitting can be realized in LA as a SPLIT-like simplification 
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rule. The calculi Op and LA disagree slightly because Op’s order < can break ties 
using P-literals and because LA can detect unsatisfiability early using the UNSAT 
rule. Despite its slightly weaker order, LA is tighter than Op in the sense that 
saturation w.r.t. Op implies saturation w.r.t. LA but not vice versa. 


Labeled Splitting. Labeled splitting, as originally described by Fietzke and 
Weidenbach [10] and implemented in SPASS, is a first-order resolution-based 
calculus with binary splitting that traverses the split tree in a depth-first way, 
using an elaborate backtracking mechanism inspired by CDCL [15]. It works on 
states (U,V), where W is a stack storing the current state of the split tree and M 
is a set of labeled clauses—clauses annotated with finite sets of natural numbers. 

We model labeled splitting as an instance of the locking prover L based 
on the splitting calculus LS = (SInf, SRed) induced by the resolution calculus 
R = (FInf, FRed), where | and W are as in Example 2 and V = Ujen{li, fi, Si}- 
A-clauses correspond to labeled clauses. Splits are identified by unique split levels. 
Given a split on C V D with level k, |, € asn(C) and rg € asn(D) represent the 
left and right branches. In practice, the prover would dynamically extend fml 
to ensure that fml(|,) = C and fml(r,) = D. 

When splitting, if we simply added L + {-l,, =r}, we would always need to 
consider either C + {l;,} or D + {rk}, depending on the interpretation. However, 
labeled splitting can undo splits when backtracking. Yet fairness would require us 
to perform inferences with either C or D even when labeled splitting would not. 
We solve this as follows. Let T = ~L. We introduce the variable sp € asn(T) so 
that we can enable or disable the split. The STRONGUNSAT rule then knows that 
sz, is true, but we can still switch to propositional models that disable both C 
and D. A-clauses are then split using the following binary variant of SPLIT: 


CVDCA 
be {alk, mrk, Sk} CAU {lk} D+} AU {rk} 
where C and D share no variables and k is the next split level. Unlike AVATAR, 
labeled splitting keeps the premise and might split it again with another level. 


To emulate the original, the locking prover based on LS must repeatedly apply 
the following three steps in any order until saturation: 


SOFTSPLIT 


1. Apply BASE to perform an inference from the enabled A-clauses. If an enabled 
L <+ A is derived with A C U;{l;, ri}, apply SWITCH or STRONGUNSAT. 

2. Apply DERIVE to simplify or delete an enabled A-clause. Use LOCK if 
necessary to remove the original A-clause. If an enabled L « A is derived 
with ACU,{k, ri}, apply SWITCH or STRONGUNSAT. 

3. Apply SOFTSPLIT with split level k on an A-clause C. Then use SWITCH to 
enable the left branch and apply LOCK on C with sx as the lock. 


SWITCH is powerful enough to support all of Fietzke and Weidenbach’s back- 
tracking rules, but to explore the tree in the same order as they do, we must 
choose the new model carefully. If a left branch is closed, the model must be 
updated so as to disable the splits that were not used to close this branch and 
to enable the right branch. If a right branch is closed, the split must be disabled, 
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and the model must switch to the right branch of the closest enabled split above 
it with an enabled left branch. If a right branch is closed but there is no split 
above with an enabled left branch, the entire tree has been visited. Then, a 
propositional clause L + A with A C U,{s;} is K-entailed by the A-clause set, 
and STRONGUNSAT can finish the refutation by exploiting fml(s;) = T. 

The above strategy helps achieve fairness, because it ensures that there exists 
exactly one limit point. It also uses locks in a well-behaved way. This means we 
can considerably simplify the notion of fairness for = -derivations and obtain a 
criterion that is almost identical to, but slightly more liberal than, Fietzke and 
Weidenbach’s—thereby re-proving the completeness of labeled splitting. 

For terminating derivations, their fairness criterion coincides with ours. For 
diverging derivations, Fietzke and Weidenbach construct a limit subsequence (®/, 
N!)i of the derivation (®;, M;); and require that every persistent inference in 
it be made redundant, exactly as we do for => _-derivations. The subsequence 
consists of all states that lie on the split tree’s unique infinite branch. Locks are 
well behaved, with limsup,_,,,||£j]] = ||£’°°]], because with the strategy above, 
once an A-clause is enabled on the rightmost branch, it remains enabled forever. 
Our definition of fairness allows more subsequences, although this is difficult to 
exploit without bringing in all the theoretical complexity of AVATAR. 


SMT with Quantifiers. Satisfiability modulo theories (SMT) solvers based on 
DPLL(T) [15] combine a SAT solver with theory solvers. In the classical setup, 
the theories are decidable, and the SMT solver is a decision procedure for the 
union of the theories. Some SMT solvers also support quantified formulas via 
instantiation at the expense of decidability. 

Complete instantiation strategies have been developed for various fragments of 
first-order logic [11,18,19]. In particular, enumerative quantifier instantiation [18] 
is complete under some conditions. An SMT solver following such a strategy ought 
to be refutationally complete, but this has never been proved. Although SMT is 
quite different from the architectures considered above, we can instantiate our 
framework to show the completeness of an abstract SMT solver. The model-guided 
prover MG will provide a suitable starting point. 

Let F be the set of first-order -formulas. We represent the SMT solver’s 
underlying SAT solver by the UNSAT rule and complement it with an inference 
system FInf that includes rules for clausification outside quantifiers, theory rea- 
soning, and instantiation. The clausification rules derive C and D from a premise 
C A D, among others; the theory rules derive L from some )-formula set N such 
that N — {L}, ignoring quantifiers; and the instantiation rules derive y(u) from 
premises Va. y(x), where u is a ground term. For FRed, we take an arbitrary 
instance of standard redundancy. Its only purpose is to split disjunctions destruc- 
tively. We define the “theories with quantifiers” calculus TQ = (FInf, FRed). For 
= and fx, we use entailment in the supported theories including quantifiers. 

We use the same approximation function as in AVATAR (Example 3). Let us 
call C <+ A a subunit if C is not a disjunction. Whenever a (ground) disjunction 
CV D+ A emerges, we immediately apply SPLIT. This delegates clausal reasoning 
to the SAT solver. It then suffices to assume that TQ is complete for subunits. 
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Theorem 26 (Dynamic completeness). Assume TQ is statically complete 
for subunit sets. Let (J;, Nj); be a fair =>mc-derivation based on TQ. If No H {L} 
and Næ contains only subunits, then L € N; for some j. 


Like AVATAR-based provers, SMT solvers will typically not perform all SInf- 
inferences, not even up to SRedy. Given a ~ b+ {vo}, b ~ cH {v1}, a ede {vo}, 
c æ d< {v3}, and a % c + {v4}, an SMT solver will find only one of the conflicts 
L<{vo,v1, v4} or L+{vo, v3, v4} but not both. For decidable theories, a practical 
fair strategy is to instantiate quantifiers only if no other rules are applicable. 

Our mathematization of AVATAR and SMT with quantifiers exposes their 
dissimilarities. With SMT, splitting is mandatory, and there is no subsumption or 
simplification, locking, or active and passive sets. And of course, theory inferences 
are n-ary and quantifier instantiation is unary, whereas superposition is binary. 
Nevertheless, their completeness follows from the same principles. 


8 Conclusion 


Our framework captures splitting calculi and provers in a general way, indepen- 
dently of the base calculus. Users can conveniently derive a dynamic refutational 
completeness result for a splitting prover based on a given statically refutation- 
ally complete calculus. As we developed the framework, we faced some tension 
between constraining the SAT solver’s behavior and the saturation prover’s. 
It seemed preferable to constrain the prover, because the prover is typically 
easier to modify than an off-the-shelf SAT solver. To our surprise, we discovered 
counterexamples related to locking, formula selection, and simplification, which 
may affect Vampire’s AVATAR implementation, depending on the SAT solver 
used. We proposed some restrictions, but alternatives could be investigated. 

We found that labeled splitting can be seen as a variant of AVATAR where 
the SAT solver follows a strict strategy and propositional variables are not 
reused across branches. A benefit of the strict strategy is that locking preserves 
completeness. As for the relationship between AVATAR and SMT, there are some 
glaring differences, including that splitting is necessary to support disjunctions in 
SMT but fully optional in AVATAR. For future work, we could try to complete 
the picture by considering other related architectures [4—6, 14]. 
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Abstract. Integers are ubiquitous in programming and therefore also in 
applications of program analysis and verification. Such applications often 
require some sort of inductive reasoning. In this paper we analyze the 
challenge of automating inductive reasoning with integers. We introduce 
inference rules for integer induction within the saturation framework of 
first-order theorem proving. We implemented these rules in the theorem 
prover VAMPIRE and evaluated our work against other state-of-the-art 
theorem provers. Our results demonstrate the strength of our approach 
by solving new problems coming from program analysis and mathemat- 
ical properties of integers. 


1 Introduction 


One of the most commonly used data types in imperative/functional programs 
are integers. For example, iterating over arrays in imperative programs or recur- 
sively computing sums in functional programs include integer-valued program 
variables, as illustrated in Figure [I] While for many uses of integers in program- 
ming we only need to consider non-negative integers, there are also applications 
where integers are essential, for example, reasoning about memory. To formally 
prove functional correctness of such and similar programs, reasoning about in- 
tegers is indispensable but so is handling some sort of induction over integers. 
In this paper we address these two reasoning challenges and fully automate in- 
ductive reasoning with integers within saturation-based theorem proving. 
Induction in saturation-based theorem proving is a new exciting direction 
in the automation of induction, recently introduced in [5] {10} fie). This work 
focused on induction on inductively defined data types, also called algebraic 
data types (12], such as natural numbers or lists. However, automating integer 
induction, that is, induction on integers, has not yet been addressed sufficiently. 
While natural numbers have a well-founded order and induction over this 
order is very useful in automated inductive theorem proving, the standard order 
on integers is not well-founded, so it cannot be directly used as the induction 
ordering. In this paper we will use the observation that the standard ordering 
< is well-founded on every set of integers having a lower bound b and likewise, 
the inverse > of this ordering is well-founded on every set of integers having an 
upper bound b. This gives us two induction rules on such integer subsets: induc- 
tion (with the base case b) using < and induction (with the base case b) using >, 
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respectively, to prove that a property holds for all integers > b and, respectively, 
< b. We define these induction rules as upward, respectively, downward induc- 
tion rules with symbolic bounds. We also consider two variations of these rules 
over integer intervals and refer to such rules as interval upward, respectively, 
downward induction rules with symbolic bounds. 

For natural numbers, 0 is an obvious base case candidate, which also turns 
out to be successful in the theorem proving practice. It is also a natural base case 
candidate for induction. In this paper we will give some natural problems for 
which neither 0 nor any concrete integer is a good base case. Our paper focuses 
on the following three issues: 


1. proofs of properties of integers by induction on bounded sets of integers in 
saturation theorem proving, using (interval) downward/upward induction 
rules with symbolic bounds; 

2. techniques for discovering a suitable base case; 

3. implementation techniques. 


This paper is organized as follows. In Section [2] we illustrate our approach 
by considering properties of the functional and imperative programs of Figure [I] 
Then in Section [3] we define four induction rules over integers, called (interval) 
downward, respectively upward, induction rules with symbolic bounds, and prove 
their soundness. Section [4] introduces an extension of superposition calculus by 
our new integer induction rules. We demonstrate that, using this extension, su- 
perposition provers can prove integer properties similarly to how humans would 
do. This extension is especially successful when used together with the AVATAR 
architecture [9], since AVATAR helps in reasoning efficiently using constraints 
coming out of the integer induction rules. 

We implemented our work in the VAMPIRE theorem prover and compare 
our implementation with other relevant provers, including VAMPIRE without in- 
teger induction (Section B). Our experiments show that integer induction can 
solve many new problems that could not so far be solved by any prover. For ex- 
ample, 75 problems coming from program analysis and/or mathematical integer 
properties could be solved only by VAMPIRE with the new induction rules. 


Contributions. This paper makes the following contributions: 


e We introduce four new inference rules for automating integer induction: (in- 
terval) downward, respectively upward, induction rules with symbolic bounds 
(Section [3). 

e Based on these rules, we introduce corresponding inference rules for integer 
induction in the superposition calculus (Section (4p. These rules are formu- 
lated in the context of saturation-based theorem proving in a way that avoids 
an immediate combinatorial explosion of the search space. 

e We implement and evaluate the new rules in the theorem prover VAMPIRE. 
Our experimental results show that our implementation can solve a number 
of problems previously unsolved by any prover (Section 5). 

e We introduce a large collection of new inductive benchmarks, publicly avail- 


able at https://github.com/vprover/inductive_benchmarks 
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assume 0 < pos < A.size 


fun sum(n,m) = i := pos; f 

En m theni while i + 1 < A.size do 

else n + sum(n + 1, m); Ali + 1] 3 Afi]; 

i:=i+l1; 
assert inv Vj € Z.(pos < j < i > vala (j + 1) = vala(J)) 
Yn, m E Z.(n < m > end 
2- sum(n, m) = 
assert 


a ae Vj € Z.(pos < j < A.size > vala(j) = vala(pos)) 


(a) Sum of integers 


fom [nm]. (b) Array initialization, with vala (j) denoting A[j]. 


Fig. 1. Motivating examples for inductive reasoning with integers. 


2 Motivating Examples 


2.1 Preliminaries 


We assume familiarity with standard many-sorted first-order logic with equal- 
ity. For details we refer to (13). Throughout this paper we denote variables by 
L,Y,e,j,n,m, constants by c, c’, Skolem constants by ø, all possibly with indices. 
We denote terms by t, literals by L, formulas by F and clauses by C. We denote 
the equality predicate by = and write tı Æ te for the literal ~(tı = t2). 

We will focus on integer induction. To this end, we assume a distinguished 
integer sort, denoted by Z. When we use standard integer predicates <, <, >, 
>, functions +,—,... and constants 0,1,2,..., we assume that they denote the 
corresponding interpreted integer predicates and functions with their standard 
interpretations. All other symbols are uninterpreted. We will write quantifiers 
like Va € Z to denote that x has the integer sort. 

In what follows, we will sometimes write “this problem requires integer in- 
duction”. This should not be regarded as a formal statement: this property is 
not easy to formalize in general and it is possible that some of these problems 
can be proved by certain combinations of decision procedures, first-order the- 
orem proving with uninterpreted functions, and axiomatization of interpreted 
functions on integers. However, when we make such statements, one can see that 
these problems have relatively simple proofs involving induction and cannot be 
proved by existing provers without induction. 


2.2 Examples 


To illustrate problems arising in automating integer induction, let us consider 
the programs of Figure |1| Properties of both programs are specified using as- 
sertions expressed in first-order logic, with pre- and post-conditions specified by 
the keywords assume and assert, respectively. 
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Functional programs. The ML-style functional program of Figure fifa) computes 
the sum sum(n, m) of integers in the interval [n, m], that is X; i, where m > n. 
The function definition uses the following axioms of sum: 


Yn € Z.(sum(n,n) = n); (1) 
Yn, m € Z.(n m > sun(n,m) = n + sum(n + 1, m)). (2) 

We should prove the assertion 
Yn, m € Z.(n < m > 2- sum(n,m) = m-(m+1)—n-(n—1)). (3) 
Formally proving requires inductive reasoning with both integers and quan- 
tifiers. Let F[z] be a formula with one or more occurrences of an integer variable 


x and b an integer term not containing x. Consider the following formula: 


F[b] A Vz € Z.(a < bA Fla] > Fla — 1]) > Yz € Z.(x < b > Fla]). (4) 


This formula is valid. It is similar to the standard induction on natural numbers, 
yet with two essential differences. First, we use x — 1 instead of x+1 and second, 
we use the term b where for the standard induction we would use 0. Note that b 
does not have to be a concrete integer, it can be any term. In the sequel we will 
refer to such terms b used in induction rules as symbolic bounds. 

For proving using a theorem prover, we first negate and skolemize $), 
obtaining the following formula, where On, om are fresh skolem constants: 


On < Oom ^2- sum(on, Om) F Om: (Om +1) - 0n: (on — 1) (5) 


Modern theorem provers implementing linear integer arithmetic and quantifiers 
can prove unsatisfiability of (i), and in a relatively straightforward way 
if we also add an instance of induction rule with 


def 
Fla] È 2- sum(z, om) = om (om +1) — z- (x — 1); 
def 
b= Ome 

Here and in the sequel df means “equal by definition” or “defined as”. If we 
want to automate this kind of reasoning, the main question is finding the cor- 
responding instance of induction rule (4), that is, finding the induction formula 
Fa] and the (symbolic) bound b. 


Imperative programs. The C-style imperative program of Figure El) initializes 
an integer-valued array A starting at the index pos. We should prove the asser- 
tion stating that all array elements at indices greater than or equal to pos are 
equal to each other. Proving such assertions typically requires loop invariants 
“summarizing” the loop behavior. One such invariant J is shown in the loop 
after the keyword inv. This invariant J could be derived by existing approaches 
to invariant generation [819]. 
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The assertion of Figure [i{b) is then proved using J, by establishing that the 
post-condition 


Vj € Z.(pos < j < A.size > vala (j) = val, (pos)) (6) 
is a logical consequence of the invariant J and the negation of the loop condition: 


Vj € Z.(pos < j <i > vala (j + 1) = vala (j)); (7) 
~(i +1 < A.size). 

Interestingly, modern theorem provers cannot perform such proofs. Similar to 
the first example, we can use an induction rule for integers formulated as follows: 


(F[bi] A Yz € Z.(bı < x < bp A F(a] > F[z + 1])) 3 
+ Va € Z.(by < 2 < by a Fie): (8) 


If we add an instance of this rule defined as follows: 
def 
Fa] = vala(a) = val, (pos); 
def 
bı = pos; 


def . 
bo = A.size — 1, 


then state-of-the-art theorem provers can easily prove that (6) is a logical con- 
sequence of and the corresponding instance of (8). For example, Cvc4 i, 
Z3 [6] and VAMPIRE prove such an instance in essentially no time. However, sim- 
ilarly to the example of Figure fifa), in order to find such proofs automatically 
using the induction rule of (8), we need to be able to discover, during the proof 
search, the induction formula F'[z] and the symbolic bounds b1, b2. In what fol- 
lows, we describe our solution to automating this discovery by integrating integer 
induction within saturation-based theorem proving. 


3 Integer Induction 


In this section we define four induction rules, or induction schemas, on integers. 
Two of them were already considered in Section [2]- namely and (8). 


Definition 1 (Downward/Upward Induction). A downward, respectively 
upward, induction axiom with symbolic bounds is any formula of the form 


F(b| AVa.(a <DA Fjz] => Fla = 1) -7 Va.(a <b> F[z]); (downward) 
Fb) AVa.(a > bA Fla] > Fla + 1]) > Va.(a > b > F|zx]), (upward) 


respectively, where F'[x] is a formula with one or more occurrences of an integer 
variable x and b is an integer term not containing x. 


Note that is a downward induction axiom with symbolic bounds. 
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Definition 2 (Interval Downward/Upward Induction). An interval down- 
ward, respectively upward, induction axiom with symbolic bounds is any formula 
of the form 


F [bs] AVa.(by <a<boA F [a] —> Flx E 1]) —> Var.(by <x <b —> F|z]); (down. ) 


respectively, where F'[x] is a formula with one or more occurrences of an integer 
variable x and b1, bg are integer terms not containing zx. 


Note that is an interval upward induction axiom with symbolic bounds. 
The main motivation for interval induction rules is their utility in reasoning 
about loops, as illustrated by the example of Figure fifa). While interval induc- 
tion can be captured by induction with one bound, it would require additional 
case analysis, which is not efficient in saturation-based proving practice. 

In the sequel, we will refer to the integer terms of b, b1, b2 from Definitions[1}2] 
as symbolic bounds and the formulas F[x] from the induction axioms of Defini- 
tions as induction formulas. 


Definition 3 (Downward/Upward Induction Rules). The downward (re- 
spectively, upward) induction rule with symbolic bounds, or simply downward 
(respectively, upward) induction rule is the inference rule whose instances are all 
downward (respectively, upward) induction axioms with symbolic bounds. 
Likewise, the interval downward (respectively, upward) induction rule with 
symbolic bounds, or simply interval downward (respectively, upward) induction 
rule is the inference rule whose instances are all interval downward (respectively, 
upward) induction axioms with symbolic bounds. 


It is easy to see that the following theorem holds. 


Theorem 1 (Soundness). The (interval) downward/upward induction rules 
of Definition [J] are sound, that is, all corresponding induction axioms from Def- 


initions are valid. 


4 Integer Induction in Saturation-Based Proof Search 


Our next aim is to define analogues of the induction rules introduced in Sec- 
tion [B] that can be used in superposition theorem provers and their saturation 
algorithms. For a general discussion of superposition and saturation we refer 
to [13]. In this section we use [O to denote the empty clause and write CNF(F) 
to mean (any) clausal normal form of a formula F. We refer to the set of clauses 
on which a saturation algorithm operates as the search space. 

The most general way to introduce our new induction rules at the calculus 
level is to add clausal forms of our new induction axioms to the search space. 
That is, for every induction axiom F from Section [8] we add the rule 


CNF (F) 
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However, we cannot efficiently implement such a calculus, as any formula with 
one variable can be used as an induction formula. We will therefore introduce 
different, more specialized, rules, which still correspond to the previously defined 
induction rules. The new rules use variations of the following three ideas: 


1. Use only simple induction formulas, for example literals; 

2. To find an induction formula, generalize a subgoal occurring in the search 
space. Then the derived induction formula can be immediately used to prove 
this subgoal,; 

3. Use (symbolic) bounds that correspond to bounds already occurring in the 
search space. 


The first two ideas were already used in the first papers underlying our approach 
to induction in saturation theorem proving [10]/16). For example, they can be 
implemented by using only induction formulas that are obtained from ground 
literals L/t] in the search space, where t is a ground term. The corresponding 
induction formula will be L[]. The idea is that, when we prove the induction 
formula, 4L[:] will be resolved against Lft]. 

The third idea is new. Note that, if we use the first two ideas and the upward 
induction rule, instead of =L[2z] we will derive b < x + —L [x]. When we resolve 
this against L[t], we obtain the clause =(b < t). However, if we already previously 
derived b < t, we can also resolve away —(b < t). This gives us the idea to only 
apply the upward induction rules when we have b < tE] 

Based on the three ideas above, we introduce the following four induction 
rules on clauses. In these rules ¢ is a ground term, b is a constant and L[z] is a 
literal containing at least one occurrence of a variable x and no other variables. 
The rules depend on which comparisons among t > b, t >b,t < band t < b 
already occur in the current search space: 


aL t>b 
CNF ( (Lf AVe.(a > bA Lj = Lle + 1])) > Yy-(y > b = Lyl)) 


(IntInd>) 


ALI} VC t> 
CNF((L[] AWa.(@ > bA L[z] > Dlx +1 


(Int Ind, ) 
) = vy.(y > b> Liy) 


(IntInd<) 


CNF( (Lb AVa.(a < bA La] > Lia — 1 ) => Vy(y <b Lly) 


<L VC t< 


(IntInd<) 


b 
) 
aALIt] VC t<b 
) 
b 
) 


CNF ( (Lb AVa.(a < bA La] > Lia — 1 ) = vy-(y < b> Lly) 


Note that IntInd> and IntInd, are upward induction rules, whereas 
IntInd< and IntInd< are downward induction rules. One can also introduce 
non-ground analogues of these rules but we do not consider them in this paper. 


4 Using the AVATAR architecture [19], we can easily obtain valid literals b < t. 
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Similarly to the above rules on the clausal level, we also introduce the interval 
upward /downward induction rules on clauses to be used in saturation algorithms 
for the superposition calculus. Since these rules are similar to each other, here 
we only define one rule IntIndj>] for interval upward induction. For a ground 
term t, constants b1, b2, and L[a] a literal containing at least one occurrence of a 
variable x and no other variables, an interval upward induction rule on clauses: 


aLi] vC t>b t<b 
CNF((L[hi] AVa.(b1 < £ < bz A Lla] > Lle + 1))) 
> Vy.(bı < y < b2 > Liy))) 


(Int Ind;>}) 


In view of Theorem [I] all induction rules of Section [3] are sound. Assuming 
that our CNF function preserves satisfiability, we conclude that all our induction 
rules IntInds, IntInd,, IntInd<, IntInd< and IntIndj5) on the clausal level 
are sound. 


Theorem 2 (Soundness). For every satisfiability preserving CNF function, 
the induction rules from Definition [5] are sound. 


Example 1. To illustrate again how the choice of induction formulas allows us 
to have shorter clauses, consider IntInd<. The CNF in its conclusion consists 


of three clauses: 
aL[b] Va < bV ~y < bV Ly] 


~L[o] v Lle] v =y < b V Liy] (9) 
L[b] V =L[o — 1] V ~y < b v Ly] 


These clauses can be resolved against premises of IntInd<, yielding the fol- 
lowing clauses: 


~L] Vo <bVC 
AL|b] v Lio} VC (10) 
AL[b] v -L[o — 1] VC 


They have an especially simple form when C is the empty clause U. In this case 
we have three clauses: 


ALib] Va <b 
aL Ib] v Llo] (11) 
aL |b] v =Lio — 1] 
which subsume the original three longer clauses and are ground. Since they are 
ground, they can be handled efficiently by AVATAR. 


Example 2. Let us now demonstrate how the downward induction rule IntInd< 
works for refuting the inductive property from our motivating example of 
Figure fifa). We use literals from (5) as the premises of the IntInd< rule. The 
corresponding instance of the downward induction rule is defined by 


def 
b= Om; 

def 
t= On: 


Liz] def | sum(£,0m) = om ` (Om + 1) — x- (x — 1). 
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This instance of IntInd< is: 


2+ sum(On,0m) £ Om ` (Om +1) — on: (on — 1) On L Om 
CNF ( (2 - SUM(Om, om) = Om ` (Om + 1) — Om: (Om — 1) 


AVa(a < Om > 2- sum(T, om) = Om ` (Om + 1) — x- (x — 1) 
> 2-sum(z — 1, om) = om ` (Om + 1) — (æ — 1) - ((z — 1) — 1))) 


> Vy. (y < om > 2 - sum(y,om) = Om: (Om +1) — y- (y — 1)) 


(IntInd<) 


This single instance of the induction rule does the magic. By adding its 
conclusion to the search space we can obtain a contradiction in a few steps by 
applying a few superposition rules and using ground reasoning in linear integer 
arithmetic with uninterpreted functions (as evidenced by the results for the first 
problem subset, x_all of sum, in Table B). 

We finally note that functional correctness of Figure [i{b) is proved by the 
interval upward induction rule IntInd;s), in a similar way as above (and as 
evidenced by the results of Table [3] for declared_unint_ax-fin_conj-fin in val). 


What we find especially interesting in Example Bis that the induction axiom 
used in it (and discovered by our implementation of induction in VAMPIRE) uses 
the induction argument that would probably be used by a majority of humans 
who would try to argue why the program property holds. 


5 Implementation and Experiments 


5.1 Implementation 


We implemented our integer induction rules IntInds, IntInd,, IntInd<, 
IntInd< as well as IntIndj>] and the other corresponding interval induc- 
tion rules in VAMPIRE. Further, we also implemented a more general induc- 
tion rule IntInd that does not require bounds to be in the search space 
and uses 0 as the lower or the upper bound. Our implementation in VAM- 
PIRE, consisting of approximately 1,200 lines of new C++ code, is available at 
The size of this additional code is rel- 
atively small because VAMPIRE has libraries for indexing and chaining inference 
rules that could be used off the shelf. 

Our (interval) downward/upward induction rules described in Section [4] can 
be applied when either (i) the comparison literal (e.g., t > b for the IntInds 
rule) is selected and the corresponding clause —L|t] V C was already selected as 
an induction candidate before, or (ii) if —L[t] V C is selected as an induction 
candidate and the corresponding comparison literal was already selected before. 
To implement these rules efficiently, we should be able to efficiently retrieve 
comparison literals and literals selected for induction. To do so, we extended 
the indexing mechanism of VAMPIRE to index such literals. We do not apply 
induction when the induction formula L[x] is a comparison having x as a top 
level argument, for example, x < t, and allow to apply it to all other induction 
formulas deemed to be suitable by other user-specified options. 
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assume e> 1 


fun power(z,1) =a 


| power(x,e) = x - power(zx,e — 1); 
assert Vx,y € Z.(power(x- y,e) = power(s, e) - power(y, e)) 


Fig. 2. ML-like functional program computing integer powers for positive exponents. 


Our (interval) downward/upward induction rules in VAMPIRE are enabled 
by the new option --induction int. The options --int_induction_interval 
infinite and --int_induction_interval finite limit the enabled rules to 
downward/upward only, and interval downward/upward only, respectively. Fur- 
ther, --int_induction_default_bound on enables the more general rule which 
does not require bounds to be in the search space. Our new induction rules 
can also be controlled by other VAMPIRE options for well-founded/structural 
induction, such as --induction_on_complex_terms on, which enables apply- 
ing induction on any ground complex term. To improve VAMPIRE’s per- 
formance for integer induction, we combined our new induction rules with 
--induction_on_complex_terms on and also other options not specific for in- 
duction. We extended VAMPIRE with a new mode scheduling various op- 
tion configurations for integer induction, switched on by the option --mode 
portfolio --schedule integer induction. Additionally, we introduced the 
option --schedule induction which uses either the integer induction configu- 
rations as for --schedule integer_induction, or structural induction config- 
urations, or both, depending on the data types used in the problem/property to 
be proved. 


5.2 Benchmarks 


We used two sets of examples: (i) benchmark sets LIA and UFLIA from the 
SMT-LIB collection p], consisting of, respectively, 607 and 10,137 examples, 
and (ii) 120 new benchmarks similar to our motivating examples from Section 2] 

To the best of our knowledge, the state-of-the-art systems implementing in- 
ductive reasoning have so far not yet considered inductive reasoning over integers, 
with two exceptions: (17], which mainly focuses on induction over inductively 
defined data types but mentions induction on non-negative integers and E1, 
which supports inductive reasoning using recursive function definitions without 
any special treatment for integers. 

Since integer induction has not yet attracted enough attention in theorem 
proving, there is no significant collection of benchmarks for integer induction. To 
properly carry out experiments, we therefore created a set of 120 new benchmarks 
based on variations of our motivating examples from Section P]and on properties 
of computing integer powers. One example is the function correctness of the 
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Set |Variant tag |Description 
xz/y sum(x,y) for x > y defined as x+sum(x+1, y) or y+sum(z, y—1) 
sum all / geq / leq |the conjecture holds for all x, y where x < y, or only for a < y= 
c, or only for c= x < y; where c € Z is an interpreted constant 
declared / val was either not defined, only declared and axiomatized (as 
defined in (6p), or defined as a total computable function (as in (14)) 
inter / unint |the axiom and conjecture use concrete interpreted constants, or 
/ mized uninterpreted constants, or a mix of both 
val | ax-fin /az-all/|the axiom holds for integers in an interval [c, c’), or for all x € Z, 
ax-leq/ax-geq |or only for x < c, or only for x > c; where c,c’ € Z are constants 
conj-fin/conj-|the conjecture holds for integers in an interval [c, c], or for all 
all/conj-leq integers, or only for integers < c, or only for integers > c; where 
/ conj-geq c,œ € Z are constants 
0/1 power defined starting with power(x,0) = 1 or power(x, 1) = x 
power |all / pos / neg [the conjecture holds either for all x,y, or only for x,y > 0, or 


only for x,y < 0 


Table 1. Description of our benchmark set of 120 new examples. 


program of Figure [2] which is formalized as follows: 


axioms: 


conjecture: 


Yr € 


YVrz,e 


Z.(power(x,1) = x) 


€ Z.(2 < e > power(x,e) = x : power(x,e — 1)) (12) 


Vz, y,e.(1 < e > power(x - y,e) = power(x, e) - power(y, e)) 


Our set of 120 new benchmarks is described in Table [and available online at: 


https://github.com/vprover/inductive_benchmarks 


To confirm that our new benchmarks require the use of inductive reasoning, 


we tested them on the SMT solver Z3 


[6] that does not support induction. 


Z3 could not solve any of the 120 problems from our benchmark set. Names of 
subsets of our new benchmarks are constructed by joining variant tags described 


in Table [1} For example, problem 


(6) belongs to the category declared_unint_ax- 


fin_conj-fin of the set val. The following benchmark: 


conjecture: 


axiom: Va € Z.(val(#) = val(x + 1)) 


Vz, y € Z.(val(x) = val(y)) (13) 


belongs to declared_unint_az-all_conj-all of val and the below example is from 
defined_inter_axz-geq_conj-geq of val: 


axioms: Va € Z.(a <0 —> val(x) = 0) 
Va € Z.(0 < x > val(x) = val (x — 1)) (14) 
conjecture: Va € Z.(0 < z > val(x) = val(0)) 


While 9 of the benchmarks (all in val) use finite intervals in both the asser- 
tion and the invariant (ax-fin_conj-fin), the remaining 111 benchmarks require 


? 


inductive reasoning over infinite intervals. 
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new compared 
Problem | Total Cvc4 | Z3 | VAMPIRE| VAMPIRE-I |e" compared to: VAMPIRE 
set count to VAMPIRE : 
Cvc4 and Z3 
LIA 607 553 | 435 216 214 10 iL 
UFLIA |10137|/ 7002 |6705| 6116 5796 99 44 


Table 2. Comparison of solvers on SMT-LIB benchmarks. 


5.3 Experimental Setup 


We ran our experiments on computers with 32 cores (AMD Epyc 7502, 2.5 
GHz) and 1 TB RAM. In all experiments we used the memory limit of 16 GB 
per problem. For the new benchmarks we used a 300 seconds time limit. For the 
experiments on the larger LIA and UFLIA sets we used a 10 seconds time limit. 

In what follows, VAMPIRE refers to the (default) version of VAMPIRE, as 
in (10){16}. By VAMPIRE-I we denote our new version of VAMPIRE, using integer 
induction rules (--induction int). VAMPIRE-I* refers to the portfolio mode 
of VAMPIRE-I, scheduling various option configurations for integer induction 
(--mode portfolio --schedule induction). 

For experiments with the new benchmarks, we note that VAMPIRE with- 
out integer induction cannot solve any of the problems. In this set of 
experiments, we therefore compared VAMPIRE-I to the provers Cvc4 
and AcL2 [11], which are, to the best of our knowledge, the only two 
automated solvers supporting inductive reasoning with integers in ad- 
dition to reasoning with theories and quantifiers. For Cvc4, we used 
the ig configuration from (17]: --quant-ind --quant-cf --conjecture-gen 
--conjecture-gen-per-round=3 --full-saturate-quant. For ACL2, we used 
its default configuration and translated our new problem set into the functional 
program encoding syntax of ACL2. In the experiments with the LIA and UFLIA 
benchmark sets of SMT-LIB, we also used Z3 [6] in the default configuration. 

We ran Cvc4, Z3, VAMPIRE and VAMPIRE-I on problems encoded in the 
SMT-LIB2 syntax (2. For running ACL2 on the new benchmarks, we translated 
problems into the functional program encoding syntax of ACL2. 


5.4 Experimental Results 


SMT-LIB Benchmarks. First, we evaluated the improvements of integer induc- 
tion in VAMPIRE-I when compared to VAMPIRE, Cvc4 and Z3 on the LIA and 
UFLIA sets of SMT-LIB p]. We aimed to verify that VAMPIRE-I’s performance 
does not deteriorate due to adding integer induction, check whether VAMPIRE-I 
can solve problems that could not be solved automatically before, and to identify 
the best values for options related to integer induction. To this end, we picked 
five different strategies (e.g. using different saturation algorithms and selection 
functions) and used different combinations of induction options. Table [2] sum- 
marizes our results, showcasing that integer induction enabled VAMPIRE-I to 
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Problem set | Problem subset Count | ACL2 | Cvc4 | VAMPIRE-I* 


So 
© 


x_all 
y_all 
sum | x_leq 
y-geq 
subset total 


declared_mixed_az-fin_conj-fin 
declared_unint_ax-fin_conj-fin 
declared_inter_axz-all_conj-all 
declared_inter_ax-all_conj-geq 
declared_inter_ax-all_conj-leq 
declared_inter_ax-geq_conj-geq 
declared_inter_ax-leq_conj-leq 
val | declared_unint_az-all_* 
declared_unint_ax-geq_conj-geq 
declared_unint_ax-leq_conj-leq 
defined_inter_ax-all_conj-all 
defined_inter_ax-geq_conj-geq 
defined_inter_ax-leq_conj-leq 
defined_unint_* 

subset total 


0_all 
O_pos 
O_neg 
power | 1_all 
1_pos 
1_neg 
subset total 
all sets | combined total 120 
all sets | uniquely solved 


Table 3. Experiments with our new benchmarks from Table [I] 
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solve over 100 new problems that VAMPIRE could not solve before (last but one 
column of Table B). Moreover, 45 of these problems were also new compared to 
Cvc4 and Z3 (last column of Table), which most likely means that no theorem 
prover was able to prove them before. 


In problems solved using integer induction, the integer induction rules were 
applied often: at least one of the interval induction rules was used in nearly 
99% of problems, while one of the induction rules with one bound was used in 
nearly all problems. The interval induction and induction rules were used on 
average 4559 and 1191 times, respectively. 89% of the proofs employed interval 
induction (67% upward, 29% downward), while 27% of the proofs used induction 
with one bound (22% upward, 8% downward). Additionally, over 64% of proofs 
only required one application of any induction rule. 
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Experiments with 120 New Benchmarks. Comparison results for VAMPIRE-I, 
AcL2 and Cvc4 on our new benchmarks are displayed in Table [3| aggregated 
by benchmark subsets, as described in Table [I] We do not show VAMPIRE in the 
table, since without integer induction it cannot solve any of the problems. 

The results show that in some cases ACL2 can perform upward and downward 
induction on integers, but only when using interpreted constants as a base case 
(that is, it cannot handle symbolic bounds). However, it can only do so if it also 
proves termination of the recursively defined function. It also has issues with 
reasoning about multiplication. 

Cvc4 has limited support for integer induction: it can apply upward induc- 
tion but only when the base case is an interpreted constant. Since some problems 
seem to require induction with symbolic bounds, Cvc4 is mostly able to either 
solve all problems in a subset, or none of them. The only exception is the subset 
declared_mixed_az-fin_conj-fin, in which Cvc4 solves one problem, which can be 
solved using upward induction with an interpreted constant as the base case. 

VAMPIRE-I* does not have any conceptual problems with solving the bench- 
marks. However, since it uses axioms and inference rules rather than dedicated 
decision procedures for handling integers, it sometime has issues with solving 
problems with large integer values. For example, for the infinite interval subset 
of the val benchmark set, the only problems VAMPIRE-I* did not solve were those 
containing the interpreted constant 100 or -100. Similarly, in the power bench- 
mark set, the unsolved problems contained large numbers. Finally, in the de- 
clared_mixed_az-fin.conj-fin subset, the two problems VAMPIRE-I* did not solve 
also required more sophisticated arithmetic reasoning. However, inability of ef- 
ficiently dealing with large numbers is not an intrinsic problem of superposition 
theorem provers. Reasoning with quantifiers and theories is still in its infancy 
and major improvements are underway. For example, there are recent parallel 
developments in superposition and linear arithmetic that should improve 
this kind of reasoning in VAMPIRE. 


6 Related Work 


Previous works on automating induction mainly focused on inductive reason- 
ing for inductively defined data types, for example in inductive theorem provers 
ACL2 [14], IsaPlanner [7], HipSpec Zeno and Imandra [14]; superposi- 
tion theorem provers Zipperposition |5| and VAMPIRE [16]; and the SMT solver 
Cvc4 E]. While most of these solvers support reasoning with integers, only 
ACL2 and Cvc4 implement some form of induction over integers. 

The ACL2 approach generates induction schemas based on recursive 
function calls in the property to be proved. Hence, it can only use induction to 
solve problems properties of recursively defined functions. On the other hand, the 
SMT-based setting of Cvc4 applies induction by inductive strengthening of 
SMT properties in combination with subgoal discovery. As noted in Section 
Cvc4 is limited to induction with concrete base cases and upward induction. 
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While downward integer induction can be considered a straightforward gener- 
alization of upward integer induction and does not solve many more problems in 
our benchmark sets, symbolic bounds provide a very powerful generalization, as 
witnessed by experimental results. In automated reasoning, the power provided 
by more general rules comes with the price of uncontrollable blowup of the search 
space. To harness this power we came up with defining (interval) upward/down- 
ward induction rules with symbolic bounds in the superposition calculus in such 
a way that they result in most cases in the addition of very simple clauses, which 
can be efficiently handled within the AVATAR architecture. 

We believe that variants of our induction rules defined in Section A] can also 
be successfully used by SMT solvers. The idea is to apply them, like we do, only 
when there is a suitable bound in the current candidate model. One can also 
combine this with the observation made in Example |1| one can resolve added 
induction formulas against literals already occurring in the search space to add 
only ground formulas. 

The benchmark suite we propose and use in this paper is new and can be 
used to complement existing benchmarks: the TIP library |3| and the examples 
of Ey. Our 120 new examples are however more focused on integer properties, 
whereas contain a variety of problems mostly requiring induction over in- 
ductively defined types. Specifically, out of more than 500 inductive problems in 
TIP B], only 3 use integers and no inductive data types. The examples from 
contain 311 inductive benchmarks translated into three encodings, (i) using only 
inductive data types, (ii) using integers instead of natural numbers, but also 
other inductive data types (such as lists or trees), and (iii) using both integers 
and natural numbers to express the same properties, alongside other inductive 
data types. Problems from (iii) are also included in SMT-LIB (2]. Note that 
there is a substantial difference between our benchmarks and benchmarks from 
(ii). The latter mostly require inductive reasoning only for inductive data types 
(or no induction at all): they contain integers but only a few of them require 
inductive reasoning over integers, while most of our benchmarks require proper 
integer induction. For example, VAMPIRE can solve 131 of 306 benchmarks in 
(ii) without using integer induction. 


7 Conclusions 


We introduced new inference rules for automating inductive reasoning with inte- 
gers within saturation-based theorem proving. Many problems in program analy- 
sis and mathematical problems of integers previously unsolvable by any theorem 
prover can now be solved completely automatically. We believe our results can 
progress automated program analysis and automation of mathematics, where 
integers are universally used. 
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Abstract. We present a complete superposition calculus for first-order 
logic with an interpreted Boolean type. Our motivation is to lay the foun- 
dation for refutationally complete calculi in more expressive logics with 
Booleans, such as higher-order logic, and to make superposition work ef- 
ficiently on problems that would be obfuscated when using clausification 
as preprocessing. Working directly on formulas, our calculus avoids the 
costly axiomatic encoding of the theory of Booleans into first-order logic 
and offers various ways to interleave clausification with other derivation 
steps. We evaluate our calculus using the Zipperposition theorem prover, 
and observe that, with no tuning of parameters, our approach is on a par 
with the state-of-the-art approach. 


1 Introduction 


Superposition is a calculus for equational first-order logic that works on problems 
given in clausal normal form. Its immense success made preprocessing clausifica- 
tion a predominant mechanism in modern automatic theorem proving. However, 
this preprocessing is not without drawbacks. Clausification can transform sim- 
ple problems, such as s —> s where s is a large formula, in a way that hides 
its original simplicity from the superposition calculus. Ganzinger and Stuber’s 
superposition-like calculus [13] operates on clauses that contain formulas as well 
as terms and replaces preprocessing clausification by inprocessing—meaning pro- 
cessing during the operation of the calculus itself. Inprocessing clausification 
allows superposition’s powerful simplification engine to work on formulas. For 
example, unit equalities can rewrite formulas s and t in s + t before clausifi- 
cation duplicates the occurrences into s > t and t > s. Whole formulas rather 
than simple literals can be removed by rules such as subsumption resolution [4]. 

Another issue with Boolean reasoning in the standard superposition calculus 
is that, in first-order logic, formulas cannot appear inside terms although this is 
often desirable for problems coming from software verifiers or proof assistants. 
Instead, authors of such tools need to resort to translations. Kotelnikov et al. 
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studied effects of these translations in detail. They showed that simple axioms 
such as the domain cardinality axiom for Booleans (V(x: 0).«% T V x ~ L) can 
severely slow down superposition provers. To support more efficient reasoning on 
problems with first-class Booleans, they describe the FOOL logic, which admits 
functions that take arguments of Boolean type and quantification over Booleans. 
They further describe two approaches to reason in FOOL: The first one [17] 
requires an additional rule in the superposition calculus, whereas the second 
one [16] is completely based on preprocessing. 

Our calculus combines complementary advantages of Ganzinger and Stuber’s 
and of Kotelnikov et al.’s work. Following Kotelnikov et al., our logic (Sect. 2) 
is similar to FOOL and supports nesting formulas inside terms, as well as quan- 
tifying over Booleans. Following Ganzinger and Stuber, our calculus (Sect. 3) 
reasons with formulas and supports inprocessing clausification. 

Our calculus also extends the two approaches. To reduce the number of 
possible inferences, we generalize Ganzinger and Stuber’s Boolean selection 
functions, which allow us to restrict the Boolean subterms in a clause on which 
inferences can be performed. The term order requirements of our calculus are 
less restrictive than Ganzinger and Stuber’s. In addition to the lexicographic 
path order (LPO), we also support the Knuth-Bendix order (KBO) [15], which 
is known to work better with superposition in practice. 

Our proof of refutational completeness (Sect. 4) lays the foundation for com- 
plete calculi in more complex logics with Booleans. Indeed, Bentkamp et al. [8] 
devised a refutationally complete calculus for higher-order logic based on our 
completeness theorem. Our theorem incorporates a powerful redundancy crite- 
rion that allows for a variety of inprocessing clausification methods (Sect. 5). 

We implemented our approach in the Zipperposition theorem prover (Sect. 6) 
and evaluated it on thousands of problems that target our logic ranging from 
TPTP to SMT-LIB to Sledgehammer-generated benchmarks (Sect. 7). Without 
fine-tuning, our new calculus performs as well as known techniques. Exploring 
the strategic choices that our calculus opens should lead to further performance 
improvements. In addition, we corroborate the claims of Ganzinger and Stuber 
concerning applicability of formula-based superposition reasoning: We find a set 
of 17 TPTP problems (out of 1000 randomly selected) that Zipperposition can 
solve only using the techniques described in this paper. We refer to our technical 
report [25] for more details on our calculus and the complete completeness proof. 


2 Logic 


Our logic is a first-order logic with an interpreted Boolean type. It is essentially 
identical to the UF logic of SMT-LIB [5], including the Core theory, but without 
if-then-else and let expressions, which can be supported through simple transla- 
tions. It also closely resembles Kotelnikov et al.’s FOOL [17], which additionally 
supports if-then-else and let expressions. 

Our logic requires an interpreted Boolean type o and allows for an arbitrary 
number of uninterpreted types. The set of symbols must contain the logical 
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symbols T,L : 0; a: 0 > 0; A,V,—: (o x 0) + 0; and the overloaded symbols 
=,9% : (T x T) > o for each type r. The logical symbols are printed in bold 
to distinguish them from the notation used for clauses below. Throughout the 
paper, we write tuples (a1,...,@n) as ān or G. 

The set of terms is defined inductively as follows. Every variable is a term. If 
f : 7, > v is a symbol and tn : 7, is a tuple of terms, then the application f(t,,) 
(or simply f if n = 0) is a term of type v. If x is a variable and t : o a Boolean 
term, then the quantified terms Vz. t and da. t are terms of Boolean type. We 
view quantified terms modulo a-renaming. A formula is a term of Boolean type. 

The root of a term is f if the term is an application f(f,,); it is x if the term 
is a variable x; and it is V or 4 if the term is a quantified term Vx. t or Jx. t. A 
variable occurrence is free in a term if it is not bound by V or J. A term is ground 
if it contains no free variables. Substitutions are defined as usual in first-order 
logic and they rename quantified variables to avoid capture. 

A literal s © t is an equation s = t or a disequation s æ% t. Unlike terms 
constructed using the function symbols % and %, literals are unoriented. A clause 
I, V+- V Ly is a finite multiset of literals Lj. The empty clause is written as 
L. Terms t of Boolean type are not literals. They must be encoded as t x T 
and t ~ L, which we call predicate literals. Both are considered positive literals 
because they are equations, not disequations. 

We have considered excluding negative literals s % t by encoding them as 
(s % t) = L, following Ganzinger and Stuber. However, this approach requires 
an additional term order condition to make the conclusion of equality factoring 
small enough, excluding KBO. To support both KBO and LPO, we allow neg- 
ative literals. Regardless, our simplification mechanism will allow us to simplify 
negative literals of the form t # L and t #% T into t ~ T and t ~ L, respectively, 
thereby eliminating redundant representations of predicate literals. 

The semantics is a straightforward extension of standard first-order logic only 
adding the interpretation of the Boolean type as a two element domain, as in 
Kotelnikov et al.’s FOOL logic. Some of our calculus rules introduce Skolem sym- 
bols, which are intended to be interpreted as witnesses for existentially quantified 
terms. Still, our semantics treats them as uninterpreted symbols. To achieve a 
satisfiability-preserving calculus, we assume that these symbols do not occur in 
the input problem. More precisely, we inductively extend the signature of the 
input problem by a symbol skyg.3z.t¢ : T —> v for each term of the form Jz. t over 
the extended signature, where v is the type of z and y:7 are the free variables 
occurring in Jz. t, in order of first appearance. 


3 The Calculus 


Following standard superposition, our calculus employs a term order and a literal 
selection function to restrict the search space. To accommodate for quantified 
Boolean terms, we impose additional requirements on the term order. To support 
flexible reasoning with Boolean subterms, in addition to the literal selection 
function, we introduce a Boolean subterm selection function. 
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Term Order The calculus is parameterized by a strict well-founded order > 
on ground terms that fulfills: (01) u > L > T for any term u that is not T or 
L; (02) Va.t > {x > u}t and Jz.t > {x 4 u} for any term u whose only 
Boolean subterms are T and L; (03) subterm property; (O4) compatibility 
with contexts (not necessarily below Y and J); (O5) totality. The order is 
extended to literals, clauses, and nonground terms as usual [2]. The nonground 
order then also enjoys (O6) stability under grounding substitutions. 

Ganzinger and Stuber’s term order restrictions are similar but incompatible 
with KBO. Using an encoding of our terms into untyped first-order logic we 
describe how both LPO and the transfinite variant of KBO [19] can satisfy 
conditions (O1)—(O6). 

Our encoding represents bound variables by De Bruijn indices, which become 
new constant symbols db,, for n € N. Quantifiers are represented by two new 
unary function symbols, also denoted by V and J. All other symbols are simply 
identified with their untyped counterpart. Regardless of symbol precedence or 
symbol weights, KBO and LPO enjoy properties (03)—(O6) when applied to the 
encoded terms. They are even compatible with contexts below quantifiers. 

To satisfy (O1) and (O02), let the precedence for LPO be T< L<f<V< 
J < dbp < db; < --- where f is any other symbol. For KBO, we can use the same 
symbol precedence and a symbol weight function W that assigns each symbol 
ordinal weights (of the form wa + b with a,b € N), where W(T) = W(L) = 
1, W(Y) = W(3) = w, and W(f) € N \ {0} for any other symbol f. 


Selection and Eligibility Following an idea of Ganzinger and Stuber, we 
parameterize our calculus with two selection functions: one selecting literals and 
one selecting Boolean subterms. 


Definition 1 (Selection functions). The calculus is parameterized by a lit- 
eral selection function FLSel and a Boolean subterm selection function FBSel. 
The function FLSel maps each clause to a subset of its literals. The selection 
function FBSel maps each clause to a subset of its Boolean subterms. The 
literals FLSel(C) and the subterms FBSel(C) are selected in C. The following 
restrictions apply: (S1) A literal can only be selected if it is negative or of the 
form s ~ L. (S2) A Boolean subterm can only be selected if it is not T, L, or 
a variable. (S3) A Boolean subterm can only be selected if its occurrence is not 
below a quantifier. (S4) The topmost terms on either side of a positive literal 
cannot be selected. 


The interplay of maximality w.r.t. term order, literal and Boolean selection 
functions gives rise to a new notion of eligibility: 


Definition 2 (Eligibility). A literal L is (strictly) eligible w.r.t. a substitution 
a in Č if it is selected in C or there are no selected literals and no selected Boolean 
subterms in C and øL is (strictly) maximal in oC. The eligible subterms of a 
clause C w.r.t. a substitution ø are inductively defined as follows: (E1) Any 
selected subterm is eligible. (E2) If a literal s % t with os Z ot is either eligible 
and negative or strictly eligible and positive, then s is eligible. (E3) If a subterm 
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is eligible and its root is not =, %, V, or J, all of its direct subterms are also 
eligible. (E4) If a subterm is eligible and of the form s st or s Æ t, then s is 
eligible if os £ ot and tis eligible if os 7 ot. The substitution ø is left implicit 
if it is the identity substitution. 


The Core Inference Rules The following inference rules form our calculus: 


D c 
Dvixt Clu] C' Vu sw Vaux 
————_— SuP FACTOR 
o(D' v Cft’}) a(C'Vugu Vurv) 
C Ce 
CV Vuga Overt fu] 
IRREFL ——— L ELM BooLRw 
aC’ aC" c'it] 
Cz. v] C[3z. v] _ 
x VRw = 4dRw 
Cl{z = skyg.azw (9) J0] CH2 > skvg.az.v (9) Jv] 
C [u] Cls % t] C[s # t] 
——— BooLHoist —————s&Hoist —————— #HoIst 
C|L] vux T C|L] vsat C[T]vsat 
Clva. t Cdx. t 
Wes VHOIST Bet JHOIST 
C|L] v {aH y}t = T CIT] v {r> yta L 


The rules are subject to the following side conditions: 


Sup (1) o = mgu(t, u); (2) u is not a variable; (3) ot £ at’; (4) D < Clu; 
(5) u is eligible in C w.r.t. o; (6) t ~ t is strictly eligible in D w.r.t. o; 
(7) the root of t is not a logical symbol; (8) if ot’ = L, the subterm u is at 
the top level of a positive literal. 

FACTOR (1) o = mgu(u,u’); (2) ou # t ¢ oC for any term t; (3) no Boolean 
subterm and no literal is selected in C; (4) ou is a maximal term in oC; 
(5) ov is maximal in {t | ou ~ t € oC}. 

IRREFL (1) o = mgu(u, u’); (2) u % w’ is eligible in C w.r.t. o. 

LELM (1) o = mgu(s ~ t, L ~ T); (2) s ~ t is strictly eligible in C w.r.t. o. 

BooLRw (1) (t,t) is one of the following pairs, where x is a fresh variable: 
(AL, T), (AT, L), (LAL, L), (TAL, L), (LAT, L), (TAT, T), (LVL, L), 
(TYL, T), (LVT, T), (TYT, T), (L> L, T), (T >L, L), (L> T, T), 
(T > T, T), (x x x, T), (x # x, L); (2) o = mgu(t,u); (3) u is not a 
variable; (4) u is eligible in C w.r.t. o. 

xRw (where x € {V,3}) (1) v is a term that may refer to z; (2) y are the 
free variables occurring in Vz. v and dz. v, respectively, in order of first 
appearance; (3) the indicated subterm is eligible in C; (4) for VRw, C[T] is 
not a tautology; (5) for JRw, C[L] is not a tautology. (In an implementation, 
the tautology check can be approximated by checking if the affected literal 
is of the form Vz.v ~ T or 3z.v  L.) 
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Boo.LHoist (1) u is a Boolean term whose root is an uninterpreted predicate; 
(2) wis eligible in C; (3) u is not a variable; (4) u is not at the top level of 
a positive literal. 

xHOIsT (where x € {=,9,V,4}) (1) the indicated subterm is eligible in C; (2) y 
is a fresh variable. 


Rationale for the Rules Our calculus is a graceful generalization of superpo- 
sition: if the input clauses do not contain any Boolean terms, it coincides with 
standard superposition. In addition to the standard superposition rules SUP, 
FACTOR, and IRREFL, our calculus contains various rules to deal with Booleans. 
For each logical symbol and quantifier, we must consider the case where it is true 
and the case where it is false. Whenever possible, we prefer rules that rewrite 
the Boolean subterm in place (with names ending in Rw). When this cannot be 
done in a satisfiability-preserving way, we resort to rules hoisting the Boolean 
subterm into a dedicated literal (with names ending in HoIsT). For terms rooted 
by an uninterpreted predicate, the rule BOOLHOIST only deals with the case that 
the term is false. If it is true, we rely on SUP to rewrite it to T eventually. 


Example 3. The clause a A~a ~ T can be refuted by the core inferences as 
follows. First we derive a ~ T (displayed on the left) and then we use it to 
derive L (displayed on the right). In this and the following example, we assume 
eager selection of literals whenever the selection restrictions allow it. 


anaana T 
BooLHOIST 
LAaaxTVvVasT 
BooLHOIST 
LAalsTvasxTVasxT aAnaxT as T 
BooLRw ——— Sup 
LAT TVaxTVarT axT TAna xT 
LeTv TV T oor TAAT aT mal 
x ax ax al x 
LELIM ————- BooLRw 
axTVvVaxT TALT 
FACTOR ———— BooLRw 
TÆ#æTvazxT zT 
IRREFL —— LELIm 
az T L 


The derivation illustrates how BOOLHOIST and SuP replace uninterpreted predi- 
cates by T and L to allow BOOLRW to eliminate the surrounding logical symbols. 


Example 4. The clause (Aa. Vy. y % x) ~ T can be refuted as follows: 


da. Vy. xT 
(Gx. Vy. y % 1) JRw 

(Yy. yY z% Skac.Vy.y¢z) xT 
La T V (y #% skaxvy.yge) © T 
(y % Skaz.vy.y¢x) xT 


a ea 
——— I ELM 
L 
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Redundancy Criterion In standard superposition, a clause is defined as re- 
dundant if all of its ground instances follow from smaller ground instances of 
other clauses. We keep this definition, but use a nonstandard notion of ground 
instances, inspired by constraint superposition [23]. In our completeness proof, 
this new notion of ground instances ensures that ground instances of the con- 
clusion of VRw, 4JRw, VHOIsT, and SHOIsT inferences are smaller than the 
corresponding instances of their premise by property (O2). 


Definition 5 (Redundancy of clauses). The ground instances of a clause C 
are all ground clauses of the form yC where y is a substitution such that for all 
variables x, the only Boolean subterms of ya are L and T. A ground clause C 
is redundant w.r.t. a ground clause set N if there exist clauses C),...,C,h E€ N 
such that C),...,C, = C and C > C; for all 1 <i < k. A nonground clause C 
is redundant w.r.t. clauses N if C is strictly subsumed by a clause in N or every 
ground instance of C is redundant w.r.t. ground instances of N. 


In standard superposition, an inference is defined as redundant if all its 
ground instances are, and a ground inference is defined as redundant if its con- 
clusion follows from other clauses smaller than the main premise. We keep this 
definition as well, but we use a nonstandard notion of ground instances for some 
of the Boolean rules. In our report, we define a slightly stronger variant of in- 
ference redundancy via an explicit ground calculus, but the following notion is 
also strong enough to justify the few prover optimizations based on inference 
redundancy we know from the literature (e.g., simultaneous superposition [7]). 


Definition 6 (Redundancy of inferences). A ground instance of a VRw, 
JRw, VHoIst, or JHOIST inference is an inference obtained by applying a 
grounding substitution to premise and conclusion, regardless of whether the 
result is a valid VRw, 3Rw, VHorst, or JHoIsT inference. A ground instance of 
an inference v of other rules is an inference v’ of the same rule such that premises 
and conclusion of ’ are ground instances of the respective premises and conclu- 
sion of v. For i’, we use selection functions that select the ground literals and 
Boolean subterms corresponding to the ones selected in the nonground premises. 
A ground inference with main premise C, side premises C1, ..., Cn, and conclu- 
sion D is redundant w.r.t. N if there exist clauses D1,..., Dk < C in N such 
that D,,...,Dx,Ci,...,Cn H= D. A nonground inference is redundant if all its 
ground instances are redundant. 


A clause set N is saturated if every inference from N is redundant w.r.t. N. 


Simplification Rules The redundancy criterion is a graceful generalization of 
the criterion of standard superposition. Thus, the standard simplification and 
deletion rules, such as deletion of trivial literals and clauses, subsumption, and 
demodulation, can be justified. Demodulation below quantifiers is justified if the 
term order is compatible with contexts below quantifiers. 

Some calculus rules can act as simplifications. LELIM can always be a simpli- 
fication. Given a clause on which both xRw and xHOIsT apply, where x € {V, 3}, 
the clause can be replaced by the conclusions of these rules. If «Rw does not 
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apply because of condition 4 or 5, xHOIST alone can be a simplification. Also 
justified by redundancy, the rules BOOLHOIST and «HOIST can simultaneously 
replace all occurrences of the eligible subterm they act on. For example, applying 
HOIST to p(x % y) = T V qlz X y) ~ L yields p( L) x TV q(L)elLVvaewry. 

While experimenting with our implementation, we have observed that the 
following simplification rule from Vampire |18] can substantially shorten proofs: 


s % t V Cs] 


= R LocaLRw 
sæ tv Cit] 
In this rule, we require s > t. 

Interpreting literals of the form s ~ T as s % L and s ~ Lass % T we 
can apply the rule even to these positive literals. This especially convenient with 
rules such as BOOLHOIsT. Consider the clause C = pi (L) ~ LV q7% L, assume 
no literal is selected and the Boolean selection function always selects a subterm 
p(L). Applying BooLHoIsT to C we get p(L) ~ TV ptHL) ~ LVq7~ L. This 
can then be simplified to a tautological clause p( L) ~ TV p(L) x LyqaL 
using į — 2 LocaLRW steps. If we did not use LocaLRw, BOOLHOIST would 
produce ¿i — 2 intermediary clauses starting from C, none of which would be 
recognized as a tautology. 

Many rules of our calculus replace subterms with T or L. After this replace- 
ment, resulting terms can be simplified using Boolean equivalences that specify 
the behavior of logical operations on T and L. To this end, we use the rule 
BOoLsimpP [33], similar to simp of Leo-III [27, Sect. 4.2.1]: 


CIs] 
ci 


BOOLSIMP 


This rule replaces s with t whenever s ®& t is contained in a predefined 
set of tautological equations. In addition to all equations that Leo-III uses 
for simp, we also include more complex ones, such as (au => u) % u and 
(ui >- Un > v1 V- VUm) & T where u; = vj for some i and j. The 
exhaustive list is given in our technical report. Using BOOLSIMP and LEL™, 
the twelve steps of Example 3 can be replaced by just two simplification steps. 
BOOLSIMP simplifies terms with logical symbol roots if one argument is either 
T or L or if two arguments are identical. Thus, after simplification, BOOLRw 
applies only in two remaining cases: if all arguments of a logical symbol are 
distinct variables and if the sides of a (dis)equation are different and unifiable. 
This observation can be used to streamline the implementation of BOOLRWw. 


4 Refutational Completeness 


Our calculus is dynamically refutationally complete. All the rules that do not 
introduce Skolem symbols are also sound. 
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Completeness Theorem 7. Let Sọ be an unsatisfiable set of clauses. Let 
(S:)?2o be a fair derivation—i.e., a derivation where U7", (\j—; S; is saturated. 
Then L € S; for some i. 


We outline some key parts of the proof here and refer to our technical 
report [25] for the details. We first define a ground version of our calculus 
with standardly inherited redundancy criterion and prove it complete. Devising 
suitable ground analogues of the rules VRw and 3Rw was difficult because the 
arguments of the Skolems depend on the variables occurring in the premise. 
Therefore, we parameterize the ground calculus by a function that provides 
ground Skolem terms in the ground versions of these rules. When lifting the 
completeness result to the nonground level, we instantiate the parameter with 
a specific function that allows us to lift the VRw and 4Rw inferences. 


To prove the ground calculus complete, we employ the framework for reduc- 
tion of counterexamples [3]. It requires us to construct an interpretation Z given 
a saturated unsatisfiable clause set that does not contain L. Then we must show 
that any counterexample—i.e., a clause that does not hold in Z—can be reduced 
to a smaller (<) counterexample by some inference. 

The interpretation Z is defined by a normalizing rewrite system as in the 
standard completeness proof of superposition. To ensure a correct interpretation 
of Booleans, we incrementally add Boolean rewrite rules along with the rules 
produced by clauses as usual. If a counterexample can be rewritten by a Boolean 
rule, we reduce it by a xRw or xHOIST inference. If it can be rewritten by a rule 
produced by a clause, we reduce it by a SUP inference. 

We derive the dynamic completeness of our nonground calculus using the 
saturation framework [35]. It gives us a nonground clause set N to work with. 
We then have to choose the parameters of our ground calculus such that all of 
its inferences from the grounding of N are redundant or liftable. We show that 
inferences rewriting below variables are redundant. Other inferences we show to 
be liftable—i.e., they are a ground instance of some inference from N. 


5 Inprocessing Clausification Methods 


Our calculus makes preprocessing clausification unnecessary: A problem specified 
by a formula f can be represented as a clause f ~ T. Our redundancy criterion 
allows us to add various sets of rules to steer the inprocessing clausification. 

Without any additional rules, our core calculus rules perform all the neces- 
sary reasoning about formulas. We call this method inner delayed clausification 
because the calculus rules tend to operate on the inner Boolean subterms first. 

The outer delayed clausification method adds the following rules to the cal- 
culus, which are guided by the outermost logical symbols. Let s and t be Boolean 
terms. Below, we let s* range over literals of the form s ~ T and s % L, and s7 
over literals of the form s ~ L and s # T. 
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stve s VC 
= + OUTERCLAUS =  — OUTERCLAUS 
oc(s,C) oc(as, C) 
sxtve 
OUTERCLAUS 
sx LlLvtxeTVvC sxTvtx LVO 
sx#ztvC 
Z#OUTERCLAUS 


sXe LVteLvC sxeTvtxeTvCc 


The rules +OUTERCLAUS and —OUTERCLAUS are applicable to any term 
s whose root is a logical symbol, whereas the rules SOUTERCLAUS and 
#OUTERCLAUS are only applicable if neither s nor t is T or L. Clearly, 
our redundancy criterion allows us to replace the premise of all OUTER- 
CLAus-rules with their conclusions. Nonetheless, the rules “OUTERCLAUS and 
#OUTERCLAUS are not used as simplification rules since destructing equiva- 
lences disturbs the syntactic structure of the formulas, as noted by Ganzinger 
and Stuber [13]. The function oc(s,C) analyzes the shape of the formula s and 
distributes it over the clause C. For example, oc(s; => s2, C) = {s1 S LV s2 % 
T V C}, and oc(A(s1 V s2),C) = {s1 ~ LV C,s2 = LV C}. This function also 
replaces quantified terms by either a fresh free variable or a Skolem in the body 
of the quantified term, depending on the polarity. The full definition of oc(s, C) 
is specified in our technical report. 

A third inprocessing clausification method is immediate clausification. It first 
preprocesses the input problem using a standard first-order clausification proce- 
dure such as Nonnengart and Weidenbach’s [24]. Then, during the proof search, 
when a clause C appears on which OUTERCLAUS rules could be applied, we 
apply the standard clausification procedure on the formula Yz. C instead (where 
& are the free variables of C), and replace C with the clausification results. 
With this method, the formulas are clausified in one step, making intermediate 
clausification results inaccessible to the simplification machinery. 


Renaming Common Formulas Following Tseitin [31], clausification proce- 
dures usually rename common formulas to prevent a possible combinatorial ex- 
plosion caused by naive clausification. In our two delayed clausification methods, 
we realize this idea using the following rule: 


Cilo: f] ee Calon f] 
Cı [oi p(z)| aS Cy [onp(Z)] Ri ca Rm 


RENAME 


Here, the formula f has a logical root, z are the distinct free variables in f, 
p is a fresh symbol, o; is a substitution, and the clauses Rj,..., Rm are the 
result of simplifying a definition clause R = p(z) ~ f as described below. The 
rule avoids exponential explosion by replacing n positions in which results of 
f’s clausification will appear into a single position in R. Optimizations such as 
polarity-aware renaming |24, Sect. 4] also apply to RENAME. 
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Several issues arise with RENAME as an inprocessing rule. We need to ensure 
that in R, f > p(Z), since otherwise demodulation might reintroduce a formula 
f in the simplified clauses. This can be achieved by giving the fresh symbol p 
a precedence smaller than that of all symbols initially present in the problem 
(other than T and L). To ensure the precedence is well founded, the precedence 
of p must be greater than that of symbols previously introduced by the calculus. 
For KBO, we additionally set the weight of p to the minimal possible weight. 

For RENAME to be used as a simplification rule, we need to ensure that the 
conclusions are smaller than the premises. This is trivially true for all clauses 
other than the clause R. For example, let C; = f ~ T (c; is the identity). Clearly, 
R is larger than C;. However, we can view the definition clause R as two clauses 
Rt = p(z) = Lv f ~x T and R = p(z) = Tv f = L. Then, we can apply 
a single step of the OUTERCLAUS rules to R* and R- (on their subformula 
f), which further results in clauses Ri,..., Rm. Inspecting the OUTERCLAUS 
rules, it is clear that m < 4, which makes enforcing this simplification tolerable. 
Furthermore, as f is simplified in each of Ri,..., Rm, they are smaller than any 
premise Ci. 

Another potential source of a combinatorial explosion in our calculus are 
formulas that occur deep in the arguments of uninterpreted predicates. Consider 
the clause C = p'(x) ~ T V qf (y) ~ T where i, j > 2. If the first and the second 
literal are eligible in C, any clause p (x) ~ T V p?(L) = T v- V p*(L) & 
Tv qit(y) x TV q?2(L)& TV---V qi (L) = T (where i, +-+:+i, = i and jı + 
+--+ 9, = J) , resulting from multiple BOOLHOIST applications, can be obtained 
in many different ways. This explosion can be avoided using the following rule: 


sxtvC 
pī  TVC Ry ++ R4 


RENAMEDEEP 


where p is a fresh symbol, z are all free variables occurring in s ~% t, the clauses 
Rı,..., R4 result from simplifying R = p(z) ~ (s % t) as described above, and 
we impose the same precedence and weight restrictions on p as for RENAME. Fi- 
nally, we require that both s ~ t and C contain deep Booleans where a Boolean 
subterm u|, of a term u is a deep Boolean if there are at least two distinct proper 
prefixes q of the position p such that the root of u|, is an uninterpreted predicate. 
Similarly to RENAME, the definition clause R can be larger than the premise. 
As OUTERCLAUS-rules might not apply to s 7% t, we need a different solution: 


Clu] 
Cll) VuxeT Ci Vvuel 


BooLHOISTSIMP 


In this rule u is a non-variable Boolean subterm, different from T and L, whose 
indicated occurrence is not in a literal u ~ b where b is T, L or a variable. 
Clearly, both conclusions of BOOLHOISTSIMP are smaller than the premise. As 
before, observing that R is equivalent to two clauses R* = p(z) ~ LV s ~ t and 
R- = p(z) ~ TVs #t, we simplify R* and R- into clauses that are guaranteed 
to be smaller than the premise. This is achieved by applying BOOLHOISTSIMP 
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to one of the deep Boolean occurrences in both R* and R~, which produces 
Ryi,..., R4 and reduces the size of resulting clauses enough for them to be smaller 
than the premise of RENAMEDEEP. The RENAMEDEEP rule can be applied 
analogously to negative literals s æ t. 


6 Implementation 


Zipperposition [11] is an automatic theorem prover designed for easy prototyping 
of various extensions of superposition. So far, it has been extended to support 
induction, arithmetic, and various fragments of higher-order logic. We have im- 
plemented our calculus and its extensions described above in Zipperposition. 

Zipperposition has long supported A as the only binder. Because introducing 
new binders would significantly complicate the implementation, we decided to 
represent the terms Vz. t and Jx. t as V(Ax.t) and A(Ax.t), respectively. 

We introduced a normalized presentation of predicate literals as either s ~ T 
or s & L. As Zipperposition previously encoded them as s ~ T or s % T, 
enforcing the new encoding was a source of tedious implementation effort. 

FACTOR inferences happen even when the maximal literal is selected since 
the discovery of condition (3) as described in Sect. 3 came after the evaluation. 

Zipperposition’s existing selection functions were not designed with Boolean 
subterm selection in mind. For instance, a function that selects a literal L with 
a selectable Boolean subterm s can make s eligible, even if the Boolean selection 
function did not select s. To mitigate this issue, we can optionally block selection 
of literals that contain selectable Boolean subterms. 

We implemented four Boolean selection functions: selecting the leftmost in- 
nermost, leftmost outermost, syntactically largest or syntactically smallest se- 
lectable subterm. Ties are broken by selecting the leftmost term. Additionally, 
we implemented a Boolean selection function that does not select any subterm. 

Vukmirovié and Nummelin [33, Sect. 3.4] explored inprocessing clausification 
as part of their pragmatic approach to higher-order Boolean reasoning. They 
describe in detail how the formula renaming mechanism is implemented. We 
reuse their mechanism, and simplify definition clauses as described in Sect. 5. 


7 Evaluation 


The goal of our evaluation was to answer the following questions: 


1. How does our approach compare to preprocessing? 

2. How do the different inprocessing clausification methods compare? 

3. Is there an overhead of our calculus on problems without first-class Booleans? 
4. What effect do Boolean selection, LOCALRw, and BOOLHOISTSIMP have? 


We filtered TPTP [29] and SMT-LIB [5] to get first-order benchmarks that 
actually do use the Boolean type. In TPTP THF we found 145 such problems 
(TPTP Bool) and in the UF section of SMT-LIB 5507 such problems. Martin 
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Desharnais and Jasmin Blanchette generated 1253 Sledgehammer problems that 
target our logic. To measure the overhead of our calculus, we randomly chose 
1000 FOF and CNF problems from the TPTP (TPTP FO). Even with this 
sample the experiment could take up to (145+5507+1253+1000) x #modes x 
300s ~ 9 CPU months. On StarExec servers, evaluation roughly took three days 
under low load. Otherwise evaluating on all 13 000 FOF and CNF problems could 
have taken 2.5 times longer. 

SMT-LIB interprets the symbol ite as the standard if-then-else function [5, 
Sect. 3.7.1]. Whenever a term s = ite(t1, t2, t3) of type T occurs in a problem, we 
replace s with f-(t1,t2,t3), where f+ is a fresh symbol denoting the ite function 
of a particular return type. To comply with SMT-LIB, we add the following 
axioms: Vz y. f(T, x,y) & x and Yzy. fr(L, x,y) & y. SMT-LIB allows the use 
of let variable bindings [5, Sect. 3.6.1]. We simply replace each variable with 
its definition in the body of the let bindings. 

Currently, among competing superposition-based provers only E and Vampire 
support first-order logic with interpreted Booleans, and they do so through pre- 
processing. We could not evaluate Vampire in the first-order mode with FOOL 
preprocessing because it yielded unsound results on TPTP Bool benchmarks. 
We were able to run E on all benchmarks, except for the ones in SMT syntax. 

We used Zipperposition’s first-order portfolio, which invokes the prover se- 
quentially with up to 13 configurations in different time slices. To compare 
different features, we ran different modes that enable a given feature in all of the 
portfolio configurations. All experiments were performed on the StarExec Iowa 
servers [28], equipped with Intel Xeon E5-2609 0 CPUs clocked at 2.40 GHz. We 
set the CPU time limit to 300s. Figure 1 displays the results. An empty cell 
indicates that a mode is not evaluated on that benchmark set. An archive with 
the raw evaluation data is publicly available.* 

A preprocessing transformation that removes all Boolean subterms occurring 
as arguments of symbols [34, Sect. 8], similar to Kotelnikov et al.’s FOOL clausi- 
fication approach [16], is implemented in Zipperposition. To answer question 1, 
we enabled preprocessing and compared it to our new calculus parameterized 
with the Boolean selection function that selects the smallest selectable subterm. 
The mode using our new calculus performs immediate inprocessing clausifica- 
tion, and we call it base, while the mode that preprocesses Boolean subterms is 
denoted by preprocess in Figure 1. 

The obtained results do not give a conclusive answer to question 1. On both 
TPTP Bool and Sledgehammer problems, some configuration of our new calculus 
manages to prove one problem more than preprocessing. On SMT-LIB bench- 
marks, the best configuration of our calculus matches preprocessing. This shows 
that our calculus already performs roughly as well as previously known tech- 
niques and suggests that it will be able to outperform preprocessing techniques 
after tuning of its parameters. 

For context, we provide the evaluation of E on supported benchmarks. On 
TPTP FO benchmarks it solves 643 problems, on TPTP Bool benchmarks 144 


* https: //doi.org/10.5281/zenodo.4550787 
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TPTP FO TPTP Bool SMT-LIB Sledgehammer 

off ] 379 

preprocess ] | 142 | 1985 639 
base | 380 | 143 | 1984 638 
base+outer ] 353 | 142 | 1978 637 
base+inner | 195 | 142 | 1699 556 
base+selmax | 381 | 140 | 1982 638 
base+sel) | 381 | 143 | 1983 638 
base+seljg J 381 J 140 J 1982 638 
base+sely | 380 | 139 | 1985 640 
base+BHS | 381 | 142 | 1983 637 
base+LocaLRw } 380 | 143 | 1979 638 


186 300 139 142 1616 1900 540 600 


Fig. 1: Number of problems solved per benchmark set and Zipperposition mode. 
The x-axes start from the number of problems solved by all evaluated modes. 


problems, and on Sledgehammer benchmarks 674 problems. Note that there is 
no straightforward way to compare these results with Zipperposition. 

Our base mode uses immediate inprocessing clausification. To answer ques- 
tion 2, we compared base with a variant of base with outer delayed clausification 
(base+outer) and with a variant with inner delayed clausification (base+inner). 
In the delayed modes, we invoke the RENAME rule on formulas that are discov- 
ered to occur more than four times in the proof state. 

The results show that inner delayed clausification, which performs the laziest 
form of clausification, gives the worst results on most benchmark sets. Outer 
delayed clausification performs roughly as well as immediate clausification on 
problems targeting our logic. On purely first-order problems, it performs slightly 
worse than immediate clausification. However, outer delayed clausification solves 
17 problems not solved by immediate clausification on these problems. This 
suggests that it opens new possibilities for first-order reasoning that need to be 
explored further with specialized strategies and additional rules. 

We found a problem with a conjecture of the form s + s that only the de- 
layed clausification modes can prove: the TPTP problem SWV122+1. The subfor- 
mula renaming mechanism of immediate clausification obfuscates this problem, 
whereas delayed clausification allows BOOLSIMP to convert the negated conjec- 
ture to L directly, completing the proof in half a second. 

To answer question 3, we compared the mode of Zipperposition in which all 
rules introduced by our calculus are disabled (off) with base on purely first-order 
problems. Our results show that both modes perform roughly the same. 

To answer question 4, we evaluated the Boolean selection functions we have 
implemented: syntactically smallest selectable term (used in base), syntactically 
largest selectable term (selmax), leftmost innermost selectable term (selj;), left- 
most outermost selectable term (selio), and no Boolean selection (selg). We also 
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evaluated two modes in which the rules LocALRw and BooLHorstSimp (BHS) 
are enabled. None of the selection functions influences the performance greatly. 
Similarly, we observe no substantial difference regardless of whether the rules 
LocaLRw and BOOLHOISTSIMP are enabled. 


8 Related Work and Conclusion 


The research presented in this paper extends superposition in two directions: 
with inprocessing clausification and with first-class Booleans. The first direc- 
tion has been explored before by Ganzinger and Stuber [13], and others have 
investigated it in the context of other superposition-related calculi [1,4,9, 20,21]. 

The other direction has been explored before by Kotelnikov et al., who devel- 
oped two approaches to cope with first-class Booleans [16,17]. For the quantified 
Boolean formula fragment of our logic, Seidl et al. developed a translation into 
effectively propositional logic [26]. More general approaches to incorporate theo- 
ries into superposition include superposition for finite domains [14], hierarchic 
superposition [6], and superposition with (co)datatypes [10]. 

For SMT solvers [22], supporting first-class Booleans is a widely accepted 
standard [5]. In contrast, the TPTP TFX format [30], intended to promote first- 
class Booleans in the rest of the automated reasoning community, has yet to gain 
traction. Software verification tools could clearly benefit from its popularization, 
as some of them identify terms and formulas in their logic, e.g., Why3 [12]. 

In conclusion, we devised a refutationally complete superposition calculus 
for first-order logic with interpreted Booleans. Its redundancy criterion allows 
us to flexibly add inprocessing clausification and other simplification rules. We 
believe our calculus is an excellent choice for the basis of new superposition 
provers: it offers the full power of standard superposition, while supporting rich 
input languages such as SMT-LIB and TPTP TFX. Even with unoptimized 
implementation and basic strategies, our calculus matches the performance of 
earlier approaches. In addition, the freedom it offers in term order, literal and 
Boolean subterm selection opens possibilities that are yet to be explored. Overall, 
our calculus appears as a solid foundation for richer logics in which the Boolean 
type cannot be efficiently preprocessed, such as higher-order logic [8]. In future 
work, we plan to tune the parameters and would find it interesting to combine 
our calculus with clause splitting techniques, such as AVATAR [82]. 
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Abstract. We recently designed two calculi as stepping stones towards super- 
position for full higher-order logic: Boolean-free A-superposition and superposi- 
tion for first-order logic with interpreted Booleans. Stepping on these stones, we 
finally reach a sound and refutationally complete calculus for higher-order logic 
with polymorphism, extensionality, Hilbert choice, and Henkin semantics. In ad- 
dition to the complexity of combining the calculus’s two predecessors, new chal- 
lenges arise from the interplay between A-terms and Booleans. Our implementa- 
tion in Zipperposition outperforms all other higher-order theorem provers and is 
on a par with an earlier, pragmatic prototype of Booleans in Zipperposition. 


1 Introduction 


Superposition is a leading calculus for first-order logic with equality. We have been 
wondering for some years whether it would be possible to gracefully generalize it to 
extensional higher-order logic and use it as the basis of a strong higher-order auto- 
matic theorem prover. Towards this goal, we have, together with colleagues, designed 
superposition-like calculi for three intermediate logics between first-order and higher- 
order logic. Now we are finally ready to assemble a superposition calculus for full 
higher-order logic. The filiation of our new calculus from Bachmair and Ganzinger’s 
standard first-order superposition is as follows: 


Standard superposition 
Bachmair and Ganzinger [2] (Sup) 


we TS 
Superposition with +> and delayed CNF Boolean-free A-free superposition 
Ganzinger and Stuber [16] («> Sup) Bentkamp et al. [7] (AfSup) 
| | 
Superposition with Booleans Boolean-free A-superposition 
Nummelin et al. [23] (oSup) Bentkamp et al. [6] (ASup) 
“Se ee 
Boolean A-superposition 
This paper (oASup) 
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Our goal was to devise an efficient calculus for higher-order logic. To achieve it, we 
pursued two objectives. First, the calculus should be refutationally complete. Second, 
the calculus should coincide as much as possible with its predecessors oSup and ASup 
on the respective fragments of higher-order logic (which in turn essentially coincide 
with Sup on first-order logic). Achieving these objectives is the main contribution of 
this paper. We made an effort to keep the calculus simple, but often the refutational 
completeness proof forced our hand to add conditions or special cases. 

Like oSup, our calculus oASup operates on clauses that can contain Boolean sub- 
terms, and it interleaves clausification with other inferences. Like ASup, oASup eagerly 
#n-normalizes terms, employs full higher-order unification, and relies on a fluid sub- 
term superposition rule (FLUIDS UP) to simulate superposition inferences below applied 
variables—i.e., terms of the form yt, ...t, forn > 1. 

Because oSup contains several superposition-like inference rules for Boolean sub- 
terms, our completeness proof requires dedicated fluid Boolean subterm hoisting rules 
(FLUIDBOOLHOIST, FLUIDLOOBHOIST), which simulate Boolean inferences below 
applied variables, in addition to FLUIDSUP, which simulates superposition inferences. 

Due to restrictions related to the term order that parameterizes superposition, it is 
difficult to handle variables bound by unclausified quantifiers if these variables occur 
applied or in arguments of applied variables. We solve the issue by replacing such quan- 
tified terms Vy. t by equivalent terms (Ay.t) =æ (Ay. T) in a preprocessing step. 

We implemented our calculus in the Zipperposition prover and evaluated it on TPTP 
and Sledgehammer benchmarks. The new Zipperposition outperforms all other higher- 
order provers and is on a par with an ad hoc implementation of Booleans in the same 
prover by Vukmirović and Nummelin [30]. We refer to the technical report [8] for the 
completeness proof and a more detailed account of the calculus and its evaluation. 


2 Logic 


Our logic is higher-order logic (simple type theory) with rank-1 polymorphism, Hilbert 
choice, and functional and Boolean extensionality. Its syntax mostly follows Gordon 
and Melham [17]. We use the notation @, or @ to stand for the tuple (a),...,d,) where 
n > 0. Deviating from Gordon and Melham, type arguments are explicit, written as 
c(7m) for a symbol c : N@»,. v and types 7m. In the type signature Zy, we require the 
presence of a nullary Boolean type constructor o and a binary function type constructor 
—. In the term signature Ł, we require the presence of the logical symbols T, L, 7, A, 
V, =>, V, J, œ, and 9. The logical symbols are shown in bold to distinguish them from 
the notation used for clauses below. Moreover, we require the presence of the Hilbert 
choice operator £ € X. Although € is interpreted in our semantics, we do not consider 
it a logical symbol. Our calculus will enforce the semantics of £ by an axiom, whereas 
the semantics of the logical symbols will be enforced by inference rules. We write V for 
the set of (term) variables. We use Henkin semantics, in the style of Fitting [15], with 
respect to which we can prove our calculus refutationally complete. In summary, our 
logic essentially coincides with the TPTP TH1 format [20]. 

We generally view terms modulo a6n-equivalence. When defining operations that 
need to analyze the structure of terms, however, we use a custom normal form as the 
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default representative of a 8n-equivalence class: The 87Q,-normal form tena, of a 
term ¢ is obtained by bringing the term into 7-short 6-normal form and finally apply- 
ing the rewrite rule Q(t) s —@, Q(t) (Ax. s x) exhaustively whenever s is not a 4- 
expression. Here and elsewhere, Q stands for either V or J. 

On top of the standard higher-order terms, we install a clausal structure that allows 
us to formulate calculus rules in the style of first-order superposition. A literal s © t is 
an equation s ~ t or disequation s % t of terms s and t; both equations and disequations 
are unordered pairs. A clause Lı V -++ V Ln is a finite multiset of literals L;. The empty 
clause is written as L. This clausal structure does not restrict the logic, because an 
arbitrary term ¢ of Boolean type can be written as the clause t ~ T. 

We considered excluding negative literals by encoding them as (s % t) ~ L, fol- 
lowing ++Sup [16]. However, this approach would make the conclusion of the equality 
factoring rule (EFACT) too large for our purposes. Regardless, the simplification ma- 
chinery will allow us to reduce negative literals t % L and t # T tots T andrx, 
respectively, thereby eliminating redundant representations of nonequational literals. 

We let CSU(s,t) denote an arbitrary (preferably, minimal) complete set of unifiers 
for two terms s and ¢ on the set of free variables of the clauses in which s and t occur. 
To compute such sets, Huet-style preunification [18] is not sufficient, and we must re- 
sort to a full unification procedure [19,29]. To cope with the nontermination of such 
procedures, we use dovetailing as described by Vukmirović et al. [28, Sect. 5]. 

Some of the rules in our calculus introduce Skolem symbols, representing objects 
mandated by existential quantification. We assume that these symbols do not occur in 
the input problem. More formally, given a problem over a term signature X, our calculus 
operates on a Skolem-extended term signature X,, that, in addition to all symbols from 
x, inductively contains symbols skpa. vz. 3z.tz : N&. T > v for all types v, variables z : v, 
and terms ¢ : v — o over Xsk, where @ are the free type variables occurring in f and x: T 
are the free term variables occurring in t, both in order of first occurrence. 


3 The Calculus 


The oASup calculus closely resembles ASup, augmented with rules for Boolean reason- 
ing that are inspired by oSup. As in ASup, superposition-like inferences are restricted 
to certain first-order-like subterms, the green subterms, which we define inductively as 
follows: Every term ż is a green subterm of ¢, and for all symbols f € È \ {V,}, if ¢ is 
a green subterm of u; for some i, then ¢ is a green subterm of f(7) u. For example, the 
green subterms of f (g (—p)) (V(t) (Ax.q)) (ya) (Ax. hb) are the term itself, g (~p), ap, 
p, V(r) (Ax.q), ya, and Ax. hb. We write s<f> to denote a term s with a green subterm t 
and call the first-order-like context s< > a green context. 

Following ASup, we call a term t fluid if (1) t Lena, is of the form yü, where n > 1, 
or (2) tlenQn is a A-expression and there exists a substitution o` such that to Lena, is 
not a -expression (due to 7-reduction). Intuitively, fluid terms are terms whose normal 
form can change radically as a result of instantiation. 

We define deeply occurring variables as in ASup, but exclude A-expressions directly 
below quantifiers: A variable occurs deeply in a clause C if it occurs inside an argument 
of an applied variable or inside a A-expression that is not directly below a quantifier. 
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Preprocessing. Our completeness theorem requires that quantified variables do not 
appear in certain higher-order contexts. We use preprocessing to eliminate problematic 
occurrences of quantifiers. The rewrite rules V~ and 4~, which we collectively denote 
by Qx, are defined as V(t) yy. Ay.y & (Ax. T) and A(t) 3, Ay. y % (Ax. L) where 
the rewritten occurrence of Q(t) is unapplied or has an argument of the form Ax. v such 
that x occurs as a nongreen subterm of v. If either of these rewrite rules can be applied 
to a given term, the term is Qx-reducible; otherwise, it is Qx-normal. 

For example, the term Ay. (1 > 1) (Ax. g xy (zy) (f x)) is Qy-normal. A term may be 
Qw-reducible because a quantifier appears unapplied (e.g., gd(c)); a quantified variable 
occurs applied (e.g., A(e — 1) (Ax. xa)); a quantified variable occurs inside a nested 4- 
expression (e.g., V(+) (Ax. f (Ay. x))); or a quantified variable occurs in the argument 
of a variable, either a free variable (e.g., V() (Ax. zx)) or a variable bound above the 
quantifier (e.g., Ay. A(t) (Ax. y x)). 

A preprocessor Qx-normalizes the input problem. Although inferences may pro- 
duce Qx-reducible clauses, we do not Qx-normalize during the derivation process it- 
self. Instead, Qx-reducible ground instances of clauses will be considered redundant by 
the redundancy criterion. Thus, clauses whose ground instances are all Qw-reducible 
can be deleted. However, there are Qx-reducible clauses, such as x V(c) ~ a, that nev- 
ertheless have Qx-normal ground instances. Such clauses must be kept because the 
completeness proof relies on their Q.-normal ground instances. 

In principle, we could omit the side condition of the Qx-rewrite rules and eliminate 
all quantifiers. However, the calculus (especially, the redundancy criterion) performs 
better with quantifiers than with A-expressions, which is why we restrict Q-normaliza- 
tion as much as the completeness proof allows. Extending the preprocessing to elimi- 
nate all Boolean terms as in Kotelnikov et al. [21] does not work for higher-order logic 
because Boolean terms can contain variables bound by enclosing A-expressions. 


Term Order. The calculus is parameterized by a well-founded strict total order > on 
ground terms satisfying these four criteria: (O1) compatibility with green contexts— 
i.e., s’ > s implies <s> > t<s>; (O2) green subterm property—i.e. t<s> = s where > is 
the reflexive closure of +; (03) u > L > T for all terms u ¢ {T, L}; (04) Q(r)t> tu 
for all types 7, terms ft, and terms u such that Q(r) t and u are Qy-normal and the only 
Boolean green subterms of u are T and L. The restriction of (04) to Qx-normal terms 
ensures that term orders fulfilling the requirements exist, but it forces us to preprocess 
the input problem. We extend > to literals and clauses via the multiset extensions in the 
standard way [2, Sect. 2.4]. 

For nonground terms, > is required to be a strict partial order such that t > s implies 
t0 > s0 for all grounding substitutions 6. As in ASup, we also introduce a nonstrict 
variant >= for which we require that t@ > sé for all grounding substitutions 0 whenever 
t Z s, and similarly for literals and clauses. 

To construct a concrete order fulfilling these requirements, we define an encoding 
into untyped first-order terms, and compare these using a variant of the Knuth-Bendix 
order. In a first step, denoted O, the encoding translates fluid terms t as fresh variables zz; 
nonfluid A-expressions Ax: T. u as lam(O(r), O(u)); applied quantifiers Q(T} (Ax: T. u) as 
Qi (O(r), O(u)); and other terms f (7) i as f(O(7), O(i)). Bound variables are encoded 
as constants db’ corresponding to De Bruijn indices. In a second step, denoted P, the 
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encoding replaces Q; by Q| and variables z by z’ whenever they occur below lam. For 
example, W(z) (Ax. pyy (Au. f yy (W(z) (Av.u)))) is encoded as Vj (£, p3(y, y, lam(o, f3 (9, 9’, 
V',(«,db'))))). The first-order terms can then be compared using a transfinite Knuth- 
Bendix order > p [22]. Let the weight of V; and 3; be w, the weight of To and Lo 
be 1, and the weights of all other symbols be less than w. Let the precedence > be 
total and Lo, To be the symbols of lowest precedence, with Lo > To. Then let t > s if 
O(P(t)) kh O(P(s)) and t = s if O(P(t)) =ke O(P(s)). 


Selection Functions. The calculus is also parameterized by a literal selection function 
and a Boolean subterm selection function. We define an element x of a multiset M to be 
&-maximal for some relation © if for all y € M with y © x, we have y = x. It is strictly 
-maximal if it is >-maximal and occurs only once in M. 

The literal selection function HLitSel maps each clause to a subset of selected lit- 
erals. A literal may not be selected if it is positive and neither side is L. Moreover, a 
literal L< y> may not be selected if y n, with n > 1, is a =-maximal term of the clause. 

The Boolean subterm selection function HBoolSel maps each clause C to a subset 
of selected subterms in C. Selected subterms must be green subterms of Boolean type. 
Moreover, a subterm s must not be selected if s = T, if s = L, if s is a variable-headed 
term, if s is at the topmost position on either side of a positive literal, or if s contains a 
variable y as a green subterm, and yi, with n > 1, is a =-maximal term of the clause. 


Eligibility. A literal L is (strictly) eligible w.r.t. a substitution o in C if it is selected 
in C or there are no selected literals and no selected Boolean subterms in C and Lo is 
(strictly) =-maximal in Co. 

The eligible subterms of a clause C w.r.t. a substitution ø are inductively defined as 
follows: Any selected subterm is eligible. If a literal L = s © t with so Z to is either 
eligible and negative or strictly eligible and positive, then the subterm s is eligible. If a 
subterm f is eligible and the head of f is not æ or #, all direct green subterms of t are 
eligible. If a subterm ż is eligible and t is of the form u & v or u # v, then u is eligible if 
uo Z vo and v is eligible if uo Z vo. 


The Core Inference Rules. The calculus consists of the following core inference rules. 
The first five rules stem from ASup, with minor adaptions concerning Booleans: 


D Cc C 
r A OT OO 
D'vtx~t Cu CVuseu CVu sv Vue 
— Sue ——— ES Ta EFACT 
(D' v C<) Cw (C Vvæv Vuxv)o 
D C 
_ J 
D'vtxt Cw CV sexs’ 
FLUIDSUP ARGCONG 
(D' V Cézt'd)o C'T V So Xn & S'O Xn 


SuP 1. uis not fluid; 2. u is not a variable deeply occurring in C; 3. if u is a variable y, 
there must exist a grounding substitution @ such that to > t'r8 and Co0 < C" o0, 
where C” = C{y > 1'}; 4.0 € CSU(t,u); 5. to Z t'o; 6. uis eligible in C w.r.t. o; 
7. Co Z Do; 8. t~ f is strictly eligible in D w.r.t. 7; 9. to is not a fully applied 
logical symbol; 10. if f'o = L, the subterm u is at the top level of a positive literal. 
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ERES 1.0 € CSU(u,u’); 2.u% u’ is eligible in C w.r.t. o. 

EFacT 1.0 € CSU(u,u’'); 2. uo Avo; 3. (u ~ v)o is X -maximal in Co; 4. uo Z vo; 
5. nothing is selected in C. 

FLUIDSUP 1. u is a variable deeply occurring in C or u is fluid; 2. z is a fresh variable; 
3.0 €CSU(zt, u); 4. (z)o Æ (zt)o; 5-10. as for SUP. 

ARGCONG 1. n > 0; 2. o is the most general type substitution that ensures well- 
typedness of the conclusion for a given n; 3. X, is a tuple of distinct fresh variables; 
4. the literal s ~ s’ is strictly eligible in C w.r.t. o. 


The following rules are concerned with Boolean reasoning and originate from oSup. 
They have been adapted to support polymorphism and applied variables. 


a BooLHoIst 
———— 
(CKD VuxT)o Cvs’ | i 
C<u —— FALSEELIM 
w EQHOIST Co 
Céu> eer a BOOLRW 
—__—_ + NEQHoIst Ct >o 
Céuy — FORALLRW 
FORALLHOIST C<y (skna. vz. Az.ayoz (@) x)>o 
C<u> - EXISTSRW 


EXISTSHOIST C<y (skna. vz. Az.yorz (@) x)>o 


(C<Tọ Vyxe Lio 


BOOLHOIST 1. æ is a type unifier of the type of u with the Boolean type o (i.e., the 
identity if u is Boolean or {a ++ o} if u is of type a for some type variable a); 
2. the head of u is neither a variable nor a logical symbol; 3. u is eligible in C; 
4. the occurrence of u is not at the top level of a positive literal. 


EQHOIST, NEQHOIST, FORALLHOIST, EXISTSHOIST 1. o € CSU(u, x % y), o € 
CSU(u, x Æ y), o € CSU(u, Via) y), or oœ € CSU(u, A(@) y), respectively; 2. x, 
y, and & are fresh variables; 3. u is eligible in C w.rt. o; 4. if the head of u is 
a variable, it must be applied and the affected literal must be of the form u ~ T, 
ux L, or u ~ v where v is a variable-headed term. 


FALSEELIM 1.0 € CSU(s* s’, L~ T); 2. s~ s' is strictly eligible in C w.r.t. o. 


BOOLRW 1. oœ € CSU(t,u) and (t,/’) is one of the following pairs, where y is a fresh 
variable: (~L, T), (AT, L), (LAL, L), (TAL, L), (LAT, L), (TAT,T), 
(LYL, L), (TVL, T), (LVT, T), (TVT, T), (LOL, T), (TOL, L), (Lo 
T,T), (TT, T), (vy, T), (yy, L); 2. wis not a variable; 3. uis eligible in 
C w.r.t. 0; 4. if the head of u is a variable, it must be applied and the affected literal 
must be of the form u ~ T, u ~ L, or u = v where v is a variable-headed term. 


FORALLRW, EXISTSRW 1. o € CSU(V() y, u) and o € CSU(A(f) y, u), respectively, 


where £ is a fresh type variable, y is a fresh term variable, @ are the free type vari- 
ables and x are the free term variables occurring in yo in order of first occurrence; 
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2. uis not a variable; 3. u is eligible in C w.r.t. 0; 4. if the head of u is a variable, it 
must be applied and the affected literal must be of the formu ~ T, u ~ L, or u ~ v 
where v is a variable-headed term; 5. for FORALLRW, the indicated occurrence of 
u is not in a literal u ~ T, and for EXISTSRW, the indicated occurrence of u is not 
in a literal u ~ L. 


Like SUP, also the Boolean rules must be simulated in fluid terms. The following 
rules are Boolean counterparts of FLUIDSUP: 
C<u> FLUID- C <u> FLUID- 
(C<z L> V xx T)o BOOLHOIST (CT vza Lo LOOBHOIST 


FLUIDBOOLHOIST 1. u is fluid; 2. z and x are fresh variables; 3. © € CSU(z x, u); 
4. (zL)o F (zx)o; 5. xo AT and xo A L; 6. wis eligible in C w.r.t. o. 
FLUIDLOOBHOIST Like the above but with L replaced by T in condition 4. 


In addition to the inference rules, our calculus relies on two axioms, below. Ax- 
iom (ExT), from ASup, embodies functional extensionality; the expression diff (a,) 
abbreviates skneg. vzy. Ax. zxy x(@, ß). Axiom (CHOICE) characterizes the Hilbert choice 
operator £. 


z (diff læ, B) zy) Æ y (diff (æ, B} zy) Vz% y (EXT) 
yxx LVy(elæjy) xT (CHOICE) 


Rationale for the Rules. Most of the calculus’s rules are adapted from its precursors. 
Sup, ERES, and EFACT are already present in Sup, with slightly different side con- 
ditions. Notably, as in AfSup and ASup, SUP inferences are required only into green 
contexts. Other subterms are accessed indirectly via ARGCONG and (EXT). 

The rules BOOLHOIST, EQHOIST, NEQHOIST, FORALLHOIST, EXISTSHOIST, 
FALSEELIM, BOOLRW, FORALLRwW, and EXISTSRW, concerned with Boolean rea- 
soning, stem from oSup, which was inspired by ++Sup. Except for BOOLHOIST and 
FALSEELIM, these rules have a condition stating that “if the head of u is a variable, it 
must be applied and the affected literal must be of the form u ~ T, ux L, or u ~ v 
where v is a variable-headed term.” The inferences at variable-headed terms permitted 
by this condition are our form of primitive substitution [1,18], a mechanism that blindly 
substitutes logical connectives and quantifiers for variables z with a Boolean result type. 


Example 1. Our calculus can prove that Leibniz equality implies equality (i.e., if two 
values behave the same for all predicates, they are equal) as follows: 


zaxzxlLvzbxT 
(xaSya)2=LVLETVxbsyb 
Te LlvLsTVwabbswbab 
LxTVwabbswbab 
axb wabbxwbab 
axa 
L 


EQHOIST 


BOOLRW 


FALSEELIM 


FALSEELIM 


SUP 


ERES 
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The EQHOIST inference, applied on zb, illustrates how our calculus introduces logical 
symbols without a dedicated primitive substitution rule. Although æ does not appear in 
the premise, we still need to apply EQHOIST on zb with CSU(zb, xo & yo) = {{z > 
Av.xv RB yv, xo xb, yo yb}}. Other calculi [1,9, 18,26] would apply an explicit 
primitive substitution rule instead, yielding essentially (xa % ya) + LV (xbæyb) xT. 
However, in our approach this clause is subsumed and could be discarded immediately. 
By hoisting the equality to the clausal level, we bypass the redundancy criterion. 

Next, BOOLRW can be applied to xa & ya with CSU(xa % ya, yo & yo) = {{x > 
Av.wavv, y+ Av.wvav, yo+ waaa}}. The two FALSEELIM steps remove the L ~ T 
literals. Then SUP is applicable with the unifier {w > Ax, x2.x3.x2} E€ CSU(b, wabb), 
and ERES derives the contradiction. 


Like in ASup, the FLUIDSUP rule is responsible for simulating superposition in- 
ferences below applied variables, other fluid terms, and deeply occurring variables. 
Complementarily, FLUIDBOOLHOIST and FLUIDLOOBHOIST simulate the various 
Boolean inference rules below fluid terms. Initially, we considered adding a fluid ver- 
sion of each rule that operates on Boolean subterms, but we discovered that FLUID- 
BOOLHOIST and FLUIDLOOBHOIST suffice to achieve refutational completeness. 


Example 2. The clause set consisting of h (yb) #h(gL) V h (ya) #h(gT) anda % b 
highlights the need for FLUIDBOOLHOIST and its companion. The set is unsatisfi- 
able because the instantiation {y +> Ax. g (x =% a)} produces the clause h (g (b =% a)) # 
h(gL) Vh(g(asa)) ¢h(gT), which is unsatisfiable in conjunction with a % b. 

The literal selection function can select either literal in the first clause. ERES is 
applicable in either case, but the unifiers {y +> Ax. gL} and {y +> Ax.g T} do not lead 
to a contradiction. Instead, we need to apply FLUIDBOOLHOIST if the first literal is 
selected or FLUIDLOOBHOIST if the second literal is selected. In the first case, the 
derivation is as follows: 


h(yb) #h(gl) vha) #h(gT) 


FLUIDBOOLHOIST 
h(z bl) #h(gL) Vh(za(x’'a)) Sh(gT) Vx bet oe 
ES 
h(g(xa)) Sh(gT) Vx baeT 
EQHOIST 
ab MEU ee ye Va TVA 
UP 
h(g(awx”a)) Sh(gT)VL&ETVaex"b 
BOOLRW 
h(gT)#¢h(gT)VL&TVaga 
ERES 
LxTVaga 
—— ERES 
LT 
FALSEELIM 


The FLUIDBOOLHOIST inference uses the unifier {y > Au.z'u (x' u), z+ Au.z' bu, 
x> x'b} € CSU(zx, yb). We apply ERES to the first literal of the resulting clause, with 
unifier {z’ > Auv. gv} € CSU(h (z'b L), h (g L)). Next, we apply EQHOIST with the 
unifier {x > Au. x" u œ x" u, w> x" b, w' > x” b} € CSU(x' b, ww’) to the literal 
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created by FLUIDBOOLHOIST, effectively performing a primitive substitution. The re- 
sulting clause can superpose into a % b with the unifier {x” > Au. u} € CSU(x" b, b). 
The two sides of the interpreted equality in the first literal can then be unified, allowing 
us to apply BOOLRW with the unifier {y +> a, x” + Au.a} € CSU(yRy, a xb). 
Finally, applying ERES twice and FALSEELIM once yields the empty clause. 
Remarkably, none of the provers that participated in the CASC-J10 competition can 
solve this two-clause problem within a minute. Satallax finds a proof after 72 s and 
LEO-II after over 7 minutes. Our new Zipperposition implementation solves it in 3 s. 


The Redundancy Criterion. In first-order superposition, a clause is considered re- 
dundant if all its ground instances are entailed by <-smaller ground instances of other 
clauses. In essence, this will also be our definition, but we will use a different notion of 
ground instances and a different notion of entailment. 

Given a clause C, let its ground instances G(C) be the set of all clauses of the form 
C6 for some substitution @ such that C@ is ground and Qx-normal, and for all variables x 
occurring in C, the only Boolean green subterms of x@ are T and L. The rationale of this 
definition is to ensure that ground instances of the conclusion of FORALLHOIST, Ex- 
ISTSHOIST, FORALLRW, and EXISTSRw inferences are smaller than the correspond- 
ing instances of their premise by property (04). 

The redundancy criterion’s notion of entailment is defined via an encoding into a 
weaker logic, following AfSup and ASup. In this paper, the weaker logic is ground first- 
order logic with interpreted Booleans—the ground fragment of the logic of oSup. Its 
signature (Xy, Gr) is derived from our higher-order signature (X+, X) as follows. The 
type constructors Y;y are the same in both signatures, but — is an uninterpreted type 
constructor in first-order logic. For each ground instance f (0) : tT] > --- > Tn > T of a 
symbol f € Ł, we introduce a first-order symbol f? € XGr with argument types 7; and 
result type Tj+1 — ++: — Tn > T, for each j. Moreover, for each ground term Ax. t, we 
introduce a symbol lamax.t € Ear of the same type. The symbols Lo, To, 71, A2, V2, 
—2, %3}, and 945 are identified with the corresponding first-order logical symbols. 

We define an encoding F of Qw-normal ground higher-order terms into this ground 
first-order logic recursively as follows: F (Y(T) (Ax.t)) = Yx. F(t) and F(A(t) (Ax.t)) = 
dx. F(t) for applied quantifiers; F (Ax. t) = lamay.r for A-expressions; and F (f(D) 5;) = 
f? (F(5))) for other terms. For quantified variables, we define F(x) = x. Here, Qw- 
normality is crucial to ensure that bound variables do not occur applied or within 4- 
expressions. The definition of green subterms is devised such that green subterms cor- 
respond to first-order subterms via the encoding F, with the exception of first-order 
subterms below quantifiers. The encoding F is extended to clauses by mapping each 
literal and each side of a literal individually. From the entailment relation = for the 
ground first-order logic, we derive an entailment relation =p on Q»x-normal ground 
higher-order clauses by defining M Ey N if F(M) = F(N). This relation is weaker 
than standard higher-order entailment; for example, {f ~ g} Fg {fa ~ ga} (because 
of the subscripts added by F) and {p (Ax. T)} Fg {p (4x. L)} (because of the lam 
symbols used by F). 

Using =p, we define a clause C to be redundant w.r.t. a clause set N if for every 
D € G(C), we have {E € G(N) | E < D} p Dor there exists a clause C’ € N such that 
C OC’ and D € G(C’). The tiebreaker 3 can be an arbitrary well-founded partial order 
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on clauses; in practice, we use a well-founded restriction of the ill-founded strict sub- 
sumption relation [6, Sect. 3.4]. We denote the set of redundant clauses w.r.t. a clause set 
N by Redc(N). Note that p is weak enough to ensure that the ARGCONG inference 
rule and axiom (EXT) are not immediately redundant and can fulfill their purpose. 

For first-order superposition, an inference is considered redundant if for each of 
its ground instances, a premise is redundant or the conclusion is entailed by clauses 
smaller than the main premise. For most inference rules, our definition follows this idea, 
using |g for entailment; other rules need nonstandard notions of ground instances and 
redundancy. The definition of inference redundancy presented below is simpler than the 
more sophisticated notion in our technical report. Nonetheless, the redundant inferences 
below are a strict subset of the redundant inferences of our report and thus completeness 
also holds using the notion below. For the few prover optimizations based on inference 
redundancy that we know about (e.g., simultaneous superposition [4]), the following 
criterion suffices. 

For Sup, ERES, EFACT, BOOLHOIST, FALSEELIM, EQHOIST, NEQHOIST, and 
BOOLRW, we define ground instances as usual: Ground instances are all inferences 
obtained by applying a grounding substitution to premises and conclusion such that the 
result adheres to the conditions of the given rule w.r.t. selection functions that select lit- 
erals and subterms as in the original premise. For FLUIDSUP and FLUIDBOOLHOIST, 
we define ground instances in the same way except that we require that ground in- 
stances adhere to the conditions of SUP or BOOLHOIST, respectively. For FORALLRW, 
EXISTSRW, FORALLHOIST, EXISTSHOIST, which do not have ground instances in the 
sense above, we define a ground instance as any inference that is obtained by applying 
the unifier ø to the premise and then applying a grounding substitution to premise and 
conclusion, regardless of whether the resulting inference is an inference of our calculus. 

For all rules except FLUIDLOOBHOIST and ARGCONG, we define an inference to 
be redundant w.r.t. a clause set N if for each ground instance v, a premise of ų is re- 
dundant w.r.t. G(N) or the conclusion of ¢ is entailed w.r.t. =p by clauses from G(N) 
that are smaller than the main (i.e., rightmost) premise of ı. For the rules FLUIDLOOB- 
HOIST and ARGCONG, as well as axioms (EXT) and (CHOICE)—viewed as premise- 
less inferences—we define an inference to be redundant w.r.t. a clause set N if all 
ground instances of its conclusion are contained in G(N) or redundant w.r.t. G(N). 
We denote the set of redundant inferences w.r.t. N by Red;(N). 


Simplification Rules. Our redundancy criterion is strong enough to support counter- 
parts of most simplification rules implemented in Schulz’s first-order E [25, Sect. 2.3.1 
and 2.3.2]. Deletion of duplicated literals, deletion of resolved literals, syntactic tau- 
tology deletion, negative simplify-reflect, and clause subsumption adhere to our re- 
dundancy criterion. Positive simplify-reflect, equality subsumption, and rewriting (de- 
modulation) of positive and negative literals are supported if they are applied on green 
subterms or on other subterms that are encoded into first-order subterms by G and F. 
Semantic tautology deletion can be applied as well, using p; moreover, for positive 
literals, the rewriting clause must be smaller than the rewritten clause. 

Under some circumstances, inference rules can be applied as simplifications. The 
FALSEELIM and BOOLRW rules can be applied as a simplification if ø is the identity. 
If the head of u is V, FORALLHOIST and FORALLRW can both be applied and, together, 
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serve as one simplification rule. The same holds for EXISTSHOIST and EXxISTSRW if 
the head of u is J. For all of these rules, the eligibility conditions can be ignored. 


Clausification. Like oSup, our calculus does not require the input problem to be clausi- 
fied during the preprocessing, and it supports higher-order analogues of the three inpro- 
cessing clausification methods introduced by Nummelin et al. Inner delayed clausi- 
fication relies on our core calculus rules to destruct logical symbols. Outer delayed 
clausification adds the following clausification rules to the calculus: 


sx TVC sx Lvc 
= POSOUTERCLAUS = NEGOUTERCLAUS 
oc(s,C) oc(7s,C) 
SHtIvec 
— _ EQOUTERCLAUS 
se LVteTVC sxTvtrelvc 
sx#tVC 
NEQOUTERCLAUS 


sXLVtRELVC sxeTVvtreTvcec 


The double bars identify simplification rules (i.e., the conclusions make the premise 
redundant and can replace it). The first two rules require that s has a logical symbol 
as its head, whereas the last two require that s and t are Boolean terms other than T 
and L. The function oc distributes the logical symbols over the clause C—e.g., oc(s > 
tiCh={sxLvtxT Vv C}, and oc(~(s Vt), C) = {s ~ L VC,t ~ Lv C}. It is 
easy to check that our redundancy criterion allows us to replace the premise of the 
OUTERCLAUS rules with their conclusion. Nonetheless, we apply EQOUTERCLAUS 
and NEQOUTERCLAUS as inferences because the premises might be useful in their 
original form. 

Besides the two delayed clausification methods, a third inprocessing clausification 
method is immediate clausification. This clausifies the input problem’s outer Boolean 
structure in one swoop, resulting in a set of higher-order clauses. If unclausified Boolean 
terms rise to the top during saturation, the same algorithm is run to clausify them. 

Unlike delayed clausification, immediate clausification is a black box and is un- 
aware of the proof state other than the Boolean term it is applied to. Delayed clausifica- 
tion, on the other hand, clausifies the term step by step, allowing us to interleave clausifi- 
cation with the strong simplification machinery of superposition provers. It is especially 
powerful in higher-order contexts: Examples such as ypq æ% (pV q) can be refuted di- 
rectly by equality resolution, rather than via more explosive rules on the clausified form. 


4 Refutational Completeness 


Our calculus is dynamically refutationally complete for problems in Qy-normal form. 
The full proof can be found in our technical report [8]. 


Theorem 3 (Dynamic refutational completeness). Let (N;); be a derivation—i.e., 
Ni\ Nizi C Redc(Nj+1) for alli. Let No be Qx-normal and such that No = L. Moreover, 
assume that (N;); is fair—i.e., all inferences from clauses in the limit inferior J; >i Nj 
are contained in |); Red,(N;). Then we have L € N; for some i. 
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Following the completeness proof of ASup, our proof is structured in three levels of 
logics. For each, we define a calculus and show that it is refutationally complete: ground 
monomorphic first-order logic with an interpreted Boolean type (GF); the Qx-normal 
ground fragment of higher-order logic (GH); and higher-order logic (H). 

The logic of the GF level is the ground fragment of oSup’s logic. The GF calculus 
is a ground version of oSup, which Nummelin et al. showed refutationally complete. It 
consists of ground first-order equivalents of our rules, excluding ARGCONG, FLUID- 
BOOLHOIsT, and FLUIDLOOBHOIST, which are specific to higher-order logic. The 
counterparts to FORALLHOIST and EXISTSHOIST enumerate ground terms instead of 
producing free variables, to stay within the ground fragment. For compatibility with the 
nonground level, the conclusions of FORALLRW and EXISTSRW cannot contain con- 
crete Skolem functions. Instead, the GF calculus is parameterized by a witness function 
that can assign an arbitrary term to each occurrence of a quantifier in a clause. This wit- 
ness function is used to retrieve the Skolem terms in the GF equivalents of FORALLRW 
and EXISTSRW. 

On the next level, the GH calculus includes inference rules isomorphic to the GF 
rules, transferred to higher-order logic via F —!. Moreover, it contains an ARGCONG 
variant that enumerates ground terms instead of introducing fresh variables, as well as 
rules enumerating ground instances of axioms (EXT) and (CHOICE). We prove refu- 
tational completeness of the GH calculus by constructing a higher-order interpretation 
based on the model constructed for the completeness proof of the GF level. This proof 
step is analogous to the corresponding step in ASup’s proof, but we must also consider 
Qx-normality and the logical symbols. 

To lift completeness to the H level, we use the saturation framework of Waldmann et 
al. [31]. The main proof obligation it leaves us to show is that nonredundant GH infer- 
ences can be lifted to corresponding nonground H inferences. For this lifting, we must 
choose a suitable GH witness function and appropriate GH selection functions for liter- 
als and Boolean subterms, given a saturated clause set at the H level and the H selection 
functions. Then the saturation framework guarantees static refutational completeness 
w.r.t. Herbrand entailment, which is the entailment relation induced by the grounding 
function G. We then show that this implies dynamic refutational completeness w.r.t. = 
for Qx-normal initial clause sets. 


5 Implementation 


We implemented our calculus in the Zipperposition prover [14], whose OCaml source 
code makes it convenient to prototype calculus extensions. Except for the presence 
of axioms (EXT) and (CHOICE), the new code gracefully extends Zipperposition’s 
implementation of oSup in the sense that oASup coincides with oSup on first-order 
problems. The same cannot be said w.r.t. ASup on Boolean-free problems because of 
the FLUIDBOOLHOIST and FLUIDLOOBHOIST rules, which are triggered by any ap- 
plied variable. From the implementation of ASup, we inherit the given clause proce- 
dure, which supports infinitely branching inferences, as well as calculus extensions and 
heuristics [28]. From the implementation of oSup, we inherit the simplification rule 
BOOLSIMP, a mainstay of our Boolean simplification machinery. 
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As in the implementation of ASup, we approximate fluid terms as terms that are ei- 
ther nonground A-expressions or terms of the form x5, with n > 0. Two slight, acciden- 
tal discrepancies are that we also count variable occurrences below quantifiers as deep 
and perform EFACT inferences even if the maximal literal is selected. Since we expect 
FLUIDBOOLHOIST and FLUIDLOOBHOIST to be highly explosive, we penalize them 
and all of their offspring. In addition to various ASup extensions [6, Sect. 5], we also 
use all the rules for Boolean reasoning described by Vukmirović and Nummelin [30] 
except for the BOOLEF rules. 


6 Evaluation 


We evaluate the calculus implementation in Zipperposition and compare it with other 
higher-order provers. Our experiments were performed on StarExec Miami servers 
equipped with Intel Xeon E5-2620 v4 CPUs clocked at 2.10 GHz. We used all 2606 
THO theorems from the TPTP 7.3.0 library [27] and 1253 “Judgment Day” problems 
[12] generated using Sledgehammer (SH) [24] as our benchmark set. An archive con- 
taining the benchmarks and the raw evaluation results is publicly available [5]. 


Calculus Evaluation. In this first part, we evaluate selected parameters of Zipperposi- 
tion by varying only the studied parameter in a fixed well-performing configuration. 
This base configuration disables axioms (CHOICE) and (EXT) and the FLUID- rules. It 
uses the unification procedure of Vukmirović et al. [29] in its complete variant—i.e., 
the variant that produces a complete set of unifiers. It uses none of the early Boolean 
rules described by Vukmirović and Nummelin [30]. The preprocessor Qw is disabled 
as well. All of the completeness-preserving simplification rules listed in Sect. 3 are en- 
abled. The configuration uses immediate clausification. We set the CPU time limit to 
30 s in all three experiments. 

In the first experiment, we assess the overhead incurred by the FLUID- rules. These 
rules unify with a term whose head is a fresh variable. Thus, we expected that they 
needed to be tightly controlled to achieve good performance. To test our hypothesis, 
we simultaneously modified the parameters of these three rules. In Figure 1, the off 
mode simply disables the rules, the pragmatic mode uses a terminating incomplete uni- 
fication algorithm (the pragmatic variant of Vukmirović et al. [29]), and the complete 
mode uses a complete unification algorithm. The results show that disabling FLUID- 
rules altogether achieves the best performance. However, on TPTP problems, complete 
finds 35 proofs not found by off, and pragmatic finds 22 proofs not found by off. On 
Sledgehammer benchmarks, this effect is much weaker, likely because the Sledgeham- 
mer benchmarks require less higher-order reasoning: complete finds only one new proof 
over off, and pragmatic finds only four. 

In the second experiment, we explore the clausification methods introduced at the 
end of Sect. 3: inner delayed clausification, outer delayed clausification, and immediate 
clausification. The modes inner and outer employ oSup’s RENAME rule, which renames 
Boolean terms headed by logical symbols using a Tseitin-like transformation if they 
occur at least four times in the proof state. Vukmirović and Nummelin [30] observed 
that outer clausification can greatly help prove higher-order problems, and we expected 
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off pragmatic complete inner outer immediate 
TPTP 1642 1591 1619 TPTP 1323 1670 1642 
SH 467 431 437 SH 406 470 467 
Fig. 1. Evaluation of FLUID- rules Fig. 2. Evaluation of clausification method 


TPTP ofSH SH 


CVC4 1.8 1796 680 619 

Leo-III 1.5.2 2104 681 621 

Vampire 4.5 2131 692 681 

a) PEP OPERE] Satallax 3.5 2162 573 587 

TPTP 1642 1617 1613 1615 1594 Zip (CASC-J10) 2301 734 736 

SH 467 458 458 459 445 New Zip 2320 724 720 
Fig. 3. Evaluation of axiom (CHOICE) Fig. 4. Evaluation of all competitive higher- 


order provers 


it to perform well for our calculus, too. The results are shown in Figure 2. The results 
confirm our hypothesis: The outer mode outperforms immediate on both TPTP and 
Sledgehammer benchmarks. The inner mode performs worst, but on Sledgehammer 
benchmarks, it proves 17 problems beyond the reach of the other two. Interestingly, 
several of these problems contain axioms of the form ¢ — y, and applying superposition 
and demodulation to these axioms is preferable to clausifying them. 

In the third experiment, we investigate the effect of axiom (CHOICE), which is nec- 
essary to achieve refutational completeness. To evaluate (CHOICE), we either disabled it 
in a configuration labeled off or set the axiom’s penalty p to different values. In Zipper- 
position, penalties are propagated through inference and simplification rules and are 
used to increase the heuristic weight of clauses, postponing the selection of penalized 
clauses. The results are shown in Figure 3. As expected, disabling (CHOICE), or at least 
penalizing it heavily, improves performance. Yet enabling (CHOICE) can be crucial: For 
19 TPTP problems, the proofs are found when (CHOICE) is enabled and p = 4, but not 
when the rule is disabled. On Sledgehammer problems, this effect is weaker, with only 
two new problems proved for p = 4. 


Prover Comparison. In this second part, we compare Zipperposition’s performance 
with other higher-order provers. Like at CASC-J10, the wall-clock time limit was 120 s, 
the CPU time limit was 960 s, and the provers were run on StarExec Miami. We used 
the following versions of all systems that took part in the THF division: CVC4 1.8 [3], 
Leo-II 1.5.2 [26], Satallax 3.5 [13], and Vampire 4.5 [11]. The developers of Vampire 
have informed us that its higher-order schedule is optimized for running on a single 
core. As a result, the prover suffers some degradation of performance when running on 
multiple cores. We evaluate both the version of Zipperposition that took part in CASC- 
J10 (Zip) and the updated version of Zipperposition that supports our new calculus (New 
Zip). Zip’s portfolio of prover configurations is based on ASup and techniques described 
by Vukmirović and Nummelin [30]. New Zip’s portfolio is specially designed for our 
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new calculus and optimized for TPTP problems. To assess the performance of Boolean 
reasoning, we used Sledgehammer benchmarks generated both with native Booleans 
(SH) and with an encoding into Boolean-free higher-order logic (ofSH). For technical 
reasons, the encoding also performs 2-lifting, but this minor transformation should have 
little impact on results [6, Sect. 7]. 

The results are shown in Figure 4. The two versions of Zipperposition are ahead 
of all other provers on both benchmark sets. This shows that, with thorough parameter 
tuning, higher-order superposition outperforms tableaux, which had been the state of 
the art in higher-order reasoning for a decade. The updated version of New Zip beats 
Zip on TPTP problems but lags behind Zip on Sledgehammer benchmarks as we have 
yet to further explore more general heuristics that work well with our new calculus. The 
Sledgehammer benchmarks fail to demonstrate the superiority of native Booleans rea- 
soning compared with an encoding, and in fact CVC4 and Leo-III perform dramatically 
better on the encoded Boolean problems, suggesting that there is room for tuning. 


7 Conclusion 


We have created a superposition calculus for higher-order logic that is refutationally 
complete. Most of the key ideas have been developed in previous work by us and col- 
leagues, but combining them in the right way has been challenging. A key idea was to 
Qx-normalize away inconvenient terms. 

Unlike earlier refutationally complete calculi for full higher-order logic based on 
resolution or paramodulation, our calculus employs a term order, which restricts the 
proof search, and a redundancy criterion, which can be used to add various simplifica- 
tion rules while keeping refutational completeness. These two mechanisms are undoubt- 
edly major factors in the success of first-order superposition, and it is very fortunate that 
we could incorporate both in a higher-order calculus. An alternative calculus with the 
same two mechanisms could be achieved by combining oSup with Bhayat and Reger’s 
combinatory superposition [10]. The article on ASup [6, Sect. 8] discusses related work 
in more detail. 

The evaluation results show that our calculus is an excellent basis for higher-order 
theorem proving. In future work, we want to experiment further with the different pa- 
rameters of the calculus (for example, with Boolean subterm selection heuristics) and 
implement it in a state-of-the-art prover such as E. 
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Abstract. Superposition is among the most successful calculi for first- 
order logic. Its extension to higher-order logic introduces new challenges 
such as infinitely branching inference rules, new possibilities such as rea- 
soning about formulas, and the need to curb the explosion of specific 
higher-order rules. We describe techniques that address these issues and 
extensively evaluate their implementation in the Zipperposition theorem 
prover. Largely thanks to their use, Zipperposition won the higher-order 
division of the CASC-J10 competition. 


1 Introduction 


In recent decades, superposition-based first-order automatic theorem provers 
have emerged as useful reasoning tools. They dominate at the annual CASC [45] 
theorem prover competitions, having always won the first-order theorem divi- 
sion. They are also used as backends to proof assistants [13, 25, 35], automatic 
higher-order theorem provers [42], and software verifiers [17]. The superposi- 
tion calculus has only recently been extended to higher-order logic, resulting 
in A-superposition [6], which we developed together with Waldmann, as well as 
combinatory superposition [10] by Bhayat and Reger. 

Both higher-order superposition calculi were designed to gracefully extend 
first-order reasoning. As most steps in higher-order proofs tend to be essentially 
first-order, extending the most successful first-order calculus to higher-order logic 
seemed worth trying. Our first attempt at corroborating this conjecture was in 
2019: Zipperposition 1.5, based on A-superposition, finished third in the higher- 
order theorem division of CASC-27 [47], 12 percentage points behind the winner, 
the tableau prover Satallax 3.4 [11]. 

Studying the competition results, we discovered that higher-order tableaux 
have some advantages over higher-order superposition. To bridge the gap, we de- 
veloped techniques and heuristics that simulate the behavior of a tableau prover 
in the context of saturation. We implemented them in Zipperposition 2, which 
took part in CASC-J10 in 2020. This time, Zipperposition won the division, 
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solving 84% of problems, a whole 20 percentage points ahead of the next best 
prover, Satallax 3.4. In this paper, we describe the main techniques that explain 
this reversal of fortunes. They range from preprocessing to backend integration. 

Interesting patterns can be observed in various higher-order encodings of 
problems. We show how we can exploit these to simplify problems (Sect. 3). By 
working on formulas rather than clauses, tableau techniques take a more holistic 
view of a higher-order problem. Delaying the clausification through the use of 
calculus rules that act on formulas achieves the same effect in superposition. We 
further explore the benefits of this approach (Sect. 4). 

The main drawback of A-superposition compared with combinatory super- 
position is that it relies on rules that enumerate possibly infinite sets of unifiers. 
We describe a mechanism that interleaves performing infinitely branching in- 
ferences with the standard saturation process (Sect. 5). The prover retains the 
same behavior as before on first-order problems, smoothly scaling with increas- 
ing numbers of higher-order clauses. We also propose some heuristics to curb the 
explosion induced by highly prolific \-superposition rules (Sect. 6). 

Using first-order backends to finish the proof is common practice in higher- 
order reasoning. Since A-superposition coincides with standard superposition on 
first-order clauses, invoking backends may seem redundant; yet Zipperposition is 
nowhere as efficient as E [38] or Vampire [28], so invoking a more efficient backend 
does make sense. We describe how to achieve a balance between allowing native 
higher-order reasoning and delegating reasoning to a backend (Sect. 7). 

Finally, we compare Zipperposition 2 with other provers on all monomorphic 
higher-order TPTP benchmarks [46] to perform a more extensive evaluation than 
at CASC (Sect. 8). Our evaluation corroborates the competition results. 


2 Background and Setting 


We focus on monomorphic higher-order logic, but the techniques can easily be ex- 
tended with polymorphism. Indeed, Zipperposition already supports some tech- 
niques polymorphically. 


Higher-Order Logic. We define terms s,t,u,v inductively as free variables 
F, X, bound variables x, y, z,..., constants f,g,a,b,..., applications st, and à- 
abstractions Ax. s. The syntactic distinction between free and bound variables 
gives rise to loose bound variables (e.g., y in Ax. ya) [32]. We let st, stand for 
st, ... tn and Agn. s for Àzı.... Azn. s. Every -normal term can be written 
as \Ym. Stn, where s is not an application; we call s the head of the term. 
If the type of a term t is of the form 7, —> --- —> Tn — o, where o is the 
distinguished Boolean type and n > 0, we call t a predicate. A literal l is an 
equation s % t or a disequation s % t. A clause is a finite multiset of literals, 
interpreted and written disjunctively lı V ++- V ln. Logical symbols that may 
occur within terms are written in boldface: ~, A, V,—>, +, .... Predicate literals 
are encoded as (dis)equations with T based on their sign; for example, even(x) 
becomes even(x) ~ T, and seven(x) becomes even(a) # T. 
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Higher-Order Calculi. The \-superposition calculus is a refutationally com- 
plete inference system and redundancy criterion for Boolean-free extensional 
polymorphic clausal higher-order logic. The calculus relies on complete sets of 
unifiers (CSUs). The CSU for s and t with respect to a set of variables V, denoted 
by CSUy(s, t), is a set of unifiers such that for any unifier ọ of s and t, there exist 
substitutions ø € CSUy(s,t) and 0 such that o(X) = o(0(X)) for all variables 
X e€ V. The set X is used to distinguish between important and auxiliary vari- 
ables. We usually omit it. A pragmatic, incomplete extension of A-superposition 
with interpreted Booleans is described by Vukmirović and Nummelin [51]. This 
forms the basis of most of this work. Recently, a refutationally complete exten- 
sion was developed by Bentkamp et al. [5]; it is not considered here. 

By contrast, the combinatory superposition calculus avoids CSUs by using a 
form of first-order unification, but essentially it enumerates higher-order terms 
using rules that instantiate applied variables with partially applied combinators 
from the complete combinator set {S,K,B,C,I}. This calculus is the basis of 
Vampire 4.5 [10], which finished closely behind Satallax 3.4 at CASC-J10. 

A different, very successful calculus is Satallax’s SAT-guided tableaux [2]. 
Satallax was the leading higher-order prover of the 2010s. Its simple and el- 
egant tableaux avoid deep superposition-style rewriting inferences. Neverthe- 
less, our working hypothesis for the past six years has been that superposition 
would likely provide a stronger basis for higher-order reasoning. Other competing 
higher-order calculi include SMT (implemented in CVC4 [3,4]) and extensional 
paramodulation (implemented in Leo-III [42]). 


Zipperposition. Zipperposition [6,12] is a higher-order theorem prover based 
on a pragmatic extension of A-superposition. It was conceived as a testbed for 
rapidly experimenting with extensions of first-order superposition, but over time, 
it has assimilated many of E’s techniques and heuristics. Zipperposition 2 also 
implements combinatory superposition. 

Several of our techniques extend the given clause procedure [30, Section 2.3], 
the standard saturation procedure. It partitions the proof state into a set P 
of passive clauses and a set A of active clauses. Initially, P contains all input 
clauses, and A is empty. At each iteration, a given clause C from P is moved to A 
(i.e., it is activated), all inferences between C and clauses in A are performed, and 
the conclusions are added to P. Because Zipperposition fully simplifies clauses 
only when they are activated, it implements a DISCOUNT-style loop [14]. 


Experimental Setup. To assess our techniques, we carried out experiments 
with Zipperposition 2. We used all 2606 monomorphic higher-order problems 
from the TPTP library [46], version 7.2.0, as benchmarks. Although some tech- 
niques support polymorphism, we uniformly used the monomorphic benchmarks. 
We fixed a base configuration of Zipperposition parameters as a baseline for all 
comparisons. Then, in each experiment, we varied the parameters associated with 
a specific technique to evaluate it. The experiments were run on StarExec [43] 
servers, equipped with Intel Xeon E5-2609 CPUs clocked at 2.40 GHz. Unless 
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otherwise stated, we used a CPU time limit of 20 s, roughly the time each con- 
figuration is given in the portfolio mode used for CASC. The raw evaluation 
results are available online.” 


3 Preprocessing Higher-Order Problems 


The TPTP library contains thousands of higher-order problems. Despite their 
diversity, they have a markedly different flavor from the TPTP first-order prob- 
lems. Notably, they extensively use the definition role to identify universally 
quantified equations (or equivalences) that define symbols. 

Definitions can be replaced by rewrite rules, using the orientation given in 
the input problem. If there are multiple definitions for the same symbol, only the 
first one is replaced by a rewrite rule. Then, whenever a clause is picked in the 
given clause procedure, it will be rewritten using the collected rules. Since the 
TPTP format enforces no constraints on definitions, rewriting might diverge. To 
ensure termination, we limit the number of applied rewrite steps. In practice, 
most TPTP problems are well behaved: Only one definition is given for each 
symbol, and the definitions are acyclic. Instead of rewriting a clause when it is 
activated, we can rewrite the input formulas as a preprocessing step. This ensures 
that the input clauses will be fully simplified when the proving process starts 
and no defined symbols will occur in clauses, which usually helps the heuristics. 

Eagerly unfolding the definitions and (-reducing can eliminate all of a prob- 
lem’s higher-order features, making it amendable to first-order methods. How- 
ever, this can inflate the problem beyond recognition and compromise the refu- 
tational completeness of superposition. 

To keep completeness, we can try to orient the definitions using the term order 
that parameterized superposition and rely on demodulation to simplify the proof 
state. Usually, the Knuth—Bendix order (KBO) [26] is used. It compares terms by 
first comparing their weights, which is the sum of all the weights assigned to the 
symbols it contains. Given a symbol weight assignment W, we can update it so 
that it orients acyclic definitions from left to right assuming that they are of the 
form f Xm & AY,,-t, where the only free variables in t are Xm, no free variable 
repeats or appears applied in t, and f does not occur in t. Then we traverse the 
symbols f that are defined by such equations following the dependency relation, 
starting with a symbol f that does not depend on any other defined symbol. For 
each f, we set W(f) to w+1, where w is the maximum weight of the right-hand 
sides of f’s definitions, computed using W. By construction, for each equation 
the left-hand side is heavier. Thus, the equations are orientable from left to right. 


Evaluation and Discussion. The base configuration treats axioms anno- 
tated with definition as rewrite rules, and it preprocesses the formulas us- 
ing the rewrite rules. We also tested the effects of disabling this preprocessing 
(—preprocess), disabling the special treatment of definition axioms (—RW), 
and disabling the special treatment of definition while using adjusted KBO 
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+LA -LA 


IC 1624 1638 
DCI 1496 1531 
1638 1627 1303 1324 DCS 1659 1710 


base —preprocess —RW —RW+KBO 


Fig. 1: Effect of the definition rewriting Fig. 2: Effect of clausification 
methods and lightweight AVATAR 


weights as described above (-RW+KBO). The results are given in Figure 1. In 
all of the figures in this paper, each cell gives the number of proved problems; 
the highest number is typeset in bold. Clearly, treating definition axioms as 
rewrite rules greatly improves performance. Using adjusted KBO weights is not 
as strong, although it proves 15 problems not proved using other configurations. 


4 Reasoning about Formulas 


Higher-order logic identifies terms and formulas. To prove a problem, we often 
need to instantiate a variable with the right predicate. Finding this predicate can 
be easier if the problem is not clausified. Consider the conjecture 3f. f pq + pAq. 
Expressed in this form, the formula is easy to prove by taking f := Axy. x ^y. 
By contrast, guessing the right instantiation for the negated, clausified form 
FpažTvpžæžTvqąæžT,FpaąaxTvpæT,FpąaæxT Vq% T is more 
challenging. One of the strengths of higher-order tableau provers is that they do 
not clausify the input problem. This might explain Satallax’s dominance in the 
THF division of CASC competitions until CASC-J10. 

We studied techniques to incrementally clausify formulas during proof search 
in incomplete [51] and complete [5] extensions of \-superposition. Both ap- 
proaches include the same set of (outer) delayed clausification rules that clausify 
top-level logical symbols, proceeding outside in; for example, a clause C’ V 
(pAq) Æ% T is transformed into C’ V p % T V q # T. The complete approach 
requires additional inference rules; it also supports inner delayed clausification. 
We focus on the pragmatic, incomplete approach and do not consider inner 
clausification due to its poor performance [5]. 

Delayed clausification rules can be used as inference rules (which add con- 
clusions to the passive set) or as simplification rules (which delete premises and 
add conclusions to the passive set). Inferences are more flexible because they 
produce all intermediate clausification states, whereas simplifications produce 
fewer clauses. Since clausifying equivalences can destroy a lot of syntactic struc- 
ture [18], we never apply simplifying clausification rules on them. 

We discuss two tableau-inspired approaches for reasoning about formulas. 
First, we study how clause-splitting techniques interfere with delayed clausifica- 
tion. Second, we discuss heuristic instantiation of quantifiers during saturation. 

Zipperposition supports a lightweight variant of AVATAR [49], an architec- 
ture that partitions the search space by splitting clauses into variable-disjoint 
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subclauses. This variant of AVATAR is described by Ebner et al. [15]. Combin- 
ing lightweight AVATAR and delayed clausification makes it possible to split a 
clause (p1 V --- V Yn) ~ T, where the ¢;’s are arbitrarily complex formulas that 
share no free variables with each other, into clauses y; + T. 

To finish the proof, it suffices to derive L under each assumption y; ~ T. 
Since the split is performed at the formula level, this technique resembles tab- 
leaux, but it exploits the strengths of superposition, such as its powerful redun- 
dancy criterion and simplification machinery, to close the branches. 

Interleaving clausification and saturation allows us to simulate another tab- 
leau technique. Whenever dynamic clausification replaces the predicate variable 
x in a clause of the form (Yx. p) ~ T V C with a fresh variable X, resulting in 
p{a +> X} = T V C, we can create additional clauses in which x is replaced 
with t € Inst, where Inst is a set of heuristically chosen terms. This set contains 
A-abstractions whose bodies are formulas and which occur in activated clauses, 
and primitive instantiations [51]—that is, imitations (in the sense of higher-order 
unification) of logical symbols that approximate the shape of a predicate that 
can instantiate a predicate variable. 

However, as a new term t can be added to Inst after a clause with a quantified 
variable of the same type as t has been activated, we must also keep track of the 
clauses {x > X} ~ T V C, so that when Inst is extended, we instantiate the 
saved clauses. Conveniently, instantiated clauses are not recognized as subsumed, 
since Zipperposition uses an optimized but incomplete subsumption algorithm. 

Given a disequation f 5„ % f tn, the abstraction of s; is Ax. u % v, where u is 
obtained by replacing s; with x in f5, and v is obtained by replacing s; with x 
in f tn. For f 3, ~ f tn, the analogous abstraction is Av. (u & v). 

Adding abstractions of the conjecture literals to Inst can provide useful in- 
stantiations for formulas such as induction principles for datatypes. As the con- 
jecture is negated, the equation’s polarity is inverted in the abstraction. Con- 
sider the TPTP problem DATO56*2 [44], whose clausified negated conjecture 
is apxs(apyszs) % ap(apxsys) zs, where ap is the append operator defined re- 
cursively on its first argument and xs, ys, and zs are of list type. Abstracting xs 
from the disequation yields t = Az. ap x (ap yszs) & ap (ap x ys) zs, which is added 
to Inst. Included in the problem is the induction axiom for the list datatype: 
Vp. (pnil A (Va zs. p xs — p (cons z xs))) + Vas.p zs, where nil and cons have the 
usual meanings. Instantiating p with t and using the ap definition, we can prove 
Va.ap x (apyszs) % ap (apa ys) zs, from which we easily derive a contradiction. 


Evaluation and Discussion. The base configuration uses immediate clausi- 
fication (IC), an approach that applies a standard clausification algorithm [33] 
both as a preprocessing step and whenever predicate variables are instantiated. 
Zipperposition’s lightweight AVATAR is disabled in the base configuration. To 
test the merits of delayed clausification, we vary base’s parameters along two 
axes: We choose immediate clausification (IC), delayed clausification as inference 
(DCI), or delayed clausification as simplification (DCS), and we either enable 
(+LA) or disable (—LA) the lightweight AVATAR. The base configuration does 
not use instantiation with terms from Inst. 
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Figure 2 shows that using delayed clausification as simplification greatly in- 
creases the success rate, while using delayed clausification as inference has the 
opposite effect. Manually inspecting the proofs found by the DCS configuration, 
we noticed that a main reason for its success is that it does not simplify away 
equivalences. Overall, the lightweight AVATAR harms performance, but the sets 
of problems proved with and without it are vastly different. For example, the 
IC+LA configuration proves 60 problems not proved by IC—LA. 

The Boolean instantiation technique presented above requires delayed clausi- 
fication. To test its effects, we enabled it in the best configuration from Figure 2, 
DCS-—LA. With this change, Zipperposition proves 1744 problems, 36 of which 
cannot be proved by any other configuration in the same figure. Boolean instanti- 
ation is the only way in which Zipperposition 2 can prove higher-order problems 
requiring reasoning about induction axioms (e.g., DATO5672). 


5 Enumerating Infinitely Branching Inferences 


As an optimization and to simplify the implementation, Leo-III [40] and Vampire 
4.4 [9] (which uses a predecessor of combinatory superposition) compute only a 
finite subset of the possible conclusions for inferences that require enumerating 
a CSU. Not only is this a source of incompleteness, but choosing the cardinality 
of the computed subset is a difficult heuristic choice. Small sets can result in 
missing the unifier necessary for the proof, whereas large sets make the prover 
spend a long time in the unification procedure, generate useless clauses, and 
possibly get sidetracked into the wrong parts of the search space. 

We propose a modification to the given clause procedure to seamlessly inter- 
leave unifier computation and proof state exploration. Given a complete unifi- 
cation procedure, which may yield infinite streams of unifiers, our modification 
fairly enumerates all conclusions of inferences relying on elements of a CSU. 
Under some reasonable assumptions, it behaves exactly like the standard given 
clause procedure on purely first-order problems. We also describe heuristics that 
help achieve a similar performance as when using incomplete, terminating uni- 
fication procedures without sacrificing completeness. 

Given the undecidability of the question as to whether there exists a next 
CSU element in a stream of unifiers, the request for the next conclusion might 
not terminate, effectively bringing the theorem prover to a halt. Our modified 
given clause procedure expects the unification procedure to return a lazily com- 
puted stream [34, Sect. 4.2], each element of which is either Ø or a singleton set 
containing a unifier. To avoid getting stuck waiting for a unifier that may not 
exist, the unification procedure should return @ after it performs a number of 
operations without finding a unifier. 

The complete unification procedure by Vukmirovié et al. [52] returns such a 
stream. Other procedures such as Huet’s [22] and Jensen and Pietrzykowski’s [23] 
can easily be adapted to meet this requirement. Based on the stream of unifiers 
interspersed with Ø, we can construct a stream of inferences similarly interspersed 
with Ø of which any finite prefixes can be computed in finite time. 
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To support such streams in the given clause procedure, we extend it to rep- 
resent the proof state not only by the active (A) and passive (P) clause sets, but 
also by a priority queue Q containing the inference streams. Each stream is asso- 
ciated with a weight, and Q is sorted in order of increasing weight. Elsewhere [6], 
Bentkamp et al. described an older version of this extension. Here we present 
a newer version in more detail, including heuristics to postpone unpromising 
streams. The pseudocode of the modified procedure is as follows: 


function EXTRACTCLAUSE(Q, stream) 
maybe_clause + pop and compute the first element of stream 
if stream is not empty then add stream to Q with an increased weight 


return maybe_clause 


function HEURISTICPROBE(Q) 
(collected_clauses, i) < (@,0) 
while i < Kpest and Q is not empty do 
(maybe_clause,7) + (0,0) 
while j < Kyctry and Q is not empty and maybe_clause = Ú do 
stream < pop the lowest weight stream in Q 
maybe_clause + EXTRACTCLAUSE(Q, stream) 
jegjtl 
collected _clauses + collected_clauses U maybe_clause 
ttit+l 
return collected_clauses 
function FAIRPROBE(Q, num_oldest) 
collected_clauses < 0 
oldest_streams < pop num_oldest oldest streams from Q 
for stream in oldest_streams do 
collected clauses + collected_clauses U EXTRACTCLAUSE(Q, stream) 


return collected_clauses 


function FORCEPROBE(Q) 
collected _clauses + 0) 
while Q is not empty and collected_clauses = do 
collected_clauses + FAIRPROBE(Q, |Q]) 


if Q and collected_clauses are empty then status + Satisfiable 
else status <- Unknown 
return (status, collected_clauses) 


function GIVENCLAUSE(P, A, Q) 
(status, i) < (Unknown, 0) 
while status = Unknown do 
if P is not empty then 
given + pop a chosen clause from P and simplify it 
if given is the empty clause then status + Unsatisfiable 
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else 
A+ AU {given} 
for stream in streams of inferences between given and other € A do 
if stream is not empty then P + PUEXTRACTCLAUSE(Q, stream) 
i i+l 
if i mod Kfair = 0 then P + P U FAIRPROBE(Q, li/Ktair]) 
else P + PU HEURISTICPROBE(Q) 


else 
(status, forced_clauses)  FORCEPROBE(Q) 
P 4+ P U forced_clauses 


return status 


Initially, all input clauses are put into P, and A and Q are empty. Unlike in 
the standard given clause procedure, inference results are represented as clause 
streams. The first element is inserted into P, and the rest of the stream is stored 
in Q with some positive integer weight computed from the inference rule. 

To eventually consider inference conclusions from streams in Q as given 
clauses, we extract elements from, or probe, streams and move any obtained 
clauses to P. Analogously to the traditional pick-given ratio [30,37], we use a 
parameter Keair (by default, Kei. = 70) to ensure fairness: Every Kfairth itera- 
tion, FAIRPROBE probes an increasing number of oldest streams, which achieves 
dovetailing. In all other iterations, HEURISTICPROBE attempts to extract up to 
Kyest clauses from the most promising streams (by default, Kyest = 7). In each 
attempt, the most promising stream in Q is chosen. If its first element is 0, the 
rest of the stream is inserted into Q, and a new stream is chosen. This is repeated 
until either K,etry occurrences of Ø have been met (by default, Kyetry = 20) or 
the stream yields a singleton set. Setting Kretry > 0 increases the chance that 
HEURISTICPROBE will return Kypest clauses, as desired. Finally, if P becomes 
empty, FORCEPROBE searches relentlessly for a clause in Q, as a fallback. 

The function EXTRACTCLAUSE extracts an element from a nonempty stream 
not in Q and inserts the remaining stream into Q with an increased weight, calcu- 
lated as follows. Let n be the number of times the stream was chosen for probing. 
If probing results in Ø, the stream’s weight is increased by max {2,n — 16}. If 
probing results in a clause C whose penalty is p, the stream’s weight is increased 
by p- max {1,n — 64}. The penalty of a clause is a number assigned by Zip- 
perposition based on features such as the depth of its derivation and the rules 
used in it. The constants 16 and 64 increase the chance that newer streams are 
picked, which is desirable because their first clauses are expected to be useful. 

All three probing functions are invoked by GIVENCLAUSE, which forms the 
body of the saturation loop. It differs from the standard given clause procedure in 
three ways: First, the proof state includes Q in addition to P and A. Second, new 
inferences involving the given clause are added to Q instead of being performed 
immediately. Third, inferences in Q are periodically performed lazily to fill P. 

GIVENCLAUSE eagerly stores the first element of a new inference stream in 
P to imitate the standard given clause procedure. If the underlying unification 
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procedure behaves like the standard first-order unification algorithm on higher- 
order logic’s first-order fragment, our given clause procedure coincides with the 
standard one. The unification procedure by Vukmirović et al. terminates on the 
first-order and other fragments [32], and for problems outside these fragments, 
it immediately returns Ø to avoid computing complicated unifiers eagerly. 


Evaluation and Discussion. When the unification procedure of Vukmirovié 
et al. was implemented in Zipperposition, it was observed that Zipperposition is 
the only competing higher-order prover that proves all Church numeral problems 
from the TPTP, never spending more than 5 seconds on the problem [52]. 

Consider the TPTP problem NUM800*1, which requires finding a function F 
such that F c1 co & co A F'coc3 & c, where Cn abbreviates the Church numeral 
for n, Asz. s"(z). To prove it, it suffices to take F to be the multiplication 
operator Ax y s z.x (y s) z. However, this unifier is only one out of many available 
for each occurrence of F. 

In an independent evaluation setup on the same set of 2606 problems used 
in this paper, Vukmirović et al. compared a complete, nonterminating variant 
and a pragmatic, terminating variant of the unification procedure [52, Sect. 7]. 
The pragmatic variant was used directly—all the inference conclusions were put 
immediately in P, bypassing Q. The complete variant, which relies on possibly 
infinite streams and is much more prolific, proved only 15 problems less than 
the most competitive pragmatic variant. Furthermore, it proved 19 problems not 
proved by the pragmatic variant. This shows that our given clause procedure, 
with its heuristics, allows the prover to defer exploring less promising branches of 
the unification and uses the full power of a complete higher-order unifier search 
to solve unification problems that cannot be solved by a crippled procedure. 

Among the competing higher-order theorem provers, only Satallax uses in- 
finitely branching calculus rules. It maintains a queue of “commands” that con- 
tain instructions on how to create a successor state in the tableau. One command 
describes infinite enumeration of all closed terms of a given function type. Each 
execution of this command makes progress in the enumeration. Unlike evaluation 
of streams representing elements of CSU, each command execution is guaranteed 
to make progress in enumerating the next closed functional term, so there is no 
need to ever return @). 


6 Controlling Prolific Rules 


To support higher-order features such as function extensionality and quantifica- 
tion over functions, many refutationally complete calculi employ highly prolific 
rules. For example, A-superposition uses a rule FLUIDSUP [6] that very often 
applies to two clauses if one of them contains a term of the form F's,,, where 
n > 0. We describe three mechanisms to keep rules like these under control. 

First, we limit applicability of the prolific rules. In practice, it often suffices to 
apply prolific higher-order rules only to initial or shallow clauses—clauses with 
a shallow derivation depth. Thus, we added an option to forbid the application 
of a rule if the derivation depth of any premise exceeds a limit. 
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Second, we penalize the streams of expensive inferences. The weight of each 
stream is given an initial value based on characteristics of the inference premises 
such as their derivation depth. For prolific rules such as FLUIDSUP, we increment 
this value by a parameter Kincr. Weights for less prolific variants of this rule, 
such as DuPSupP [6], are increased by a fraction of Kine: (e.g., |Kincr/3]). 

Third, we defer the selection of prolific clauses. To select the given clause, 
most saturating provers evaluate clauses according to some criteria and select 
the clause with the lowest evaluation. For this choice to be efficient, passive 
clauses are organized into a priority queue ordered by their evaluations. Like 
E, Zipperposition maintains multiple queues, ordered by different evaluations, 
that are visited in a round-robin fashion. It also uses E’s two-layer evaluation 
functions, a variant of which has recently been implemented in Vampire [19]. 
The two layers are clause priority and clause weight. Clauses with higher pri- 
ority are preferred, and the weight is used for tie-breaking. Intuitively, the first 
layer crudely separates clauses into priority classes, whereas the second one uses 
heuristic weights to prefer clauses within a priority class. To control the selec- 
tion of prolific clauses, we introduce new clause priority functions that take into 
account features specific to higher-order clauses. 

The first new priority function PreferHOSteps (PHOS) assigns a higher pri- 
ority if rules specific to A- or combinatory superposition were used in the clause 
derivation. Since most of the other clause priority functions tend to defer higher- 
order clauses, having a clause queue that prefers the results of higher-order in- 
ferences might be necessary to find a proof more efficiently. A simpler function, 
which prefers clauses containing \-abstractions, is PreferLambda (PL). 

We also introduce the priority function ByNormalizationFactor (BNF), in- 
spired by the observation that a higher-order inference that applies a compli- 
cated substitution to a clause is usually followed by a 67-normalization step. If 
Bn-normalization greatly reduces the size of a clause, it is likely that this sub- 
stitution simplifies the clause (e.g., by removing a variable’s arguments). Thus, 
this function prefers clauses that were produced by {7-normalization, and among 
those it prefers the ones with larger size reductions. 

Another new priority function is PreferShallowAppVars (PSAV). This prefers 
clauses with lower depths of the deepest occurrence of an applied variable—that 
is, C[X a] is preferred over C[f (X a)]. This function tries to curb the explosion 
of both A- and combinatory superposition: Applying a substitution to a top-level 
applied variable often reduces this applied variable to a term with a constant 
head, which likely results in a less explosive clause. Among the functions that rely 
on properties of applied variables we implemented PreferDeepAppVars (PDAV), 
which returns the priority opposite of PSAV, and ByAppVarNum (BAVN), which 
prefers clauses with fewer occurrences of applied variables. 


Evaluation and Discussion. In the base configuration, Zipperposition visits 
several clause queues, one of which uses the constant priority function ConstPrio 
(CP). To evaluate the new priority functions, we replaced the queue ordered by 
CP with the queue ordered by one of the new functions, leaving the clause weight 
intact. The results are shown in Figure 3. It shows that the expensive priority 
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base (CP) BAVN PL PSAV PHOS BNF  PDAV 
1638 1640 1637 1637 1632 1594 1520 


Fig. 3: Effect of the priority function on performance 


base (co) 16 8 4 2 1 


1638 1619 1621 1618 1612 1610 


Fig. 4: Effect of the FLUIDSUP weight increment Kincr on performance 


functions PHOS and BNF, which require inspecting the proof of clauses, hardly 
help. Simple functions such as PL are more effective: Compared with base, PL 
loses one problem overall but proves 22 new problems. 

FLUIDSUP is disabled in base because it is so explosive. To test if increas- 
ing inference stream weights makes a difference on the success rate, we enabled 
FLUIDSUP and used different weight increments Kincy for FLUIDSUP inference 
queues. The results are shown in Figure 4. As expected, using a low incre- 
ment with FLUIDSUP is detrimental to performance. However, as the column 
for Kincr = 16 shows, nor should we use too high an increment, since that delays 
useful FLUIDSUP inferences. Interestingly, even though the configuration with 
Kincr = 1 proves the least problems overall, it proves 7 problems not proved by 
base, which is more than any other configuration we tried. 


7 Controlling the Use of Backends 


Cooperation with efficient first-order theorem provers is an essential feature of 
higher-order theorem provers such as Leo-III [40, Sect. 4.4] and Satallax [11]. 
Those provers invoke first-order backends repeatedly during a proof attempt 
and spend a substantial amount of time in backend collaboration. Since A-super- 
position generalizes a highly efficient first-order calculus, we expect that future 
efficient \-superposition implementations will not benefit much from backends. 
Experimental provers such as Zipperposition can still gain a lot. We present 
some techniques for controlling the use of backends. 

In his thesis [40, Sect. 6.1], Steen extensively evaluates the effects of using dif- 
ferent first-order backends on the performance of Leo-III. His results suggest that 
adding only one backend already substantially improves the performance. To re- 
duce the effort required for integrating multiple backends, we chose Ehoh [50] as 
our single backend. Ehoh is an extension of the highly optimized superposition 
prover E with support for higher-order features such as partial application, ap- 
plied variables, and interpreted Booleans. On the one hand, Ehoh provides the 
efficiency of E while easing the translation from full higher-order logic: The only 
missing syntactic feature is A-abstraction. On the other hand, Ehoh’s higher- 
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base 0.1 0.25 05 0.75 base lifting SKBCI omitted 

1638 1936 1935 1934 1923 1638 1935 1867 1855 
Fig. 5: Effect of the backend invo- Fig. 6: Effect of the method used 
cation point Kitime to translate A-abstractions 


base 16 32 64 128 256 512 


1638 1936 1935 1939 1928 1925 1912 


Fig. 7: Effect of the number of selected clauses Ksize 


order reasoning capabilities are limited. Its unification algorithm is essentially 
first-order and it cannot synthesize \-abstractions. 

In a departure from Leo-III and other cooperative provers, we invoke the 
backend at most once during a run of the prover. This is because most competi- 
tive higher-order provers use a portfolio mode in which many configurations are 
run for a short time, and we want to leave enough time for native higher-order 
reasoning. Moreover, multiple backend invocations tend to be wasteful, because 
currently each invocation starts with no knowledge of the previous ones. 

Only a carefully chosen subset of the available clauses are translated and sent 
to Ehoh. Let I be the set of input clauses. Given a proof state, let M = PUA, 
and let Mho denote the subset of M that contains only clauses that were derived 
using at least one \-superposition-specific inference rule. We order the clauses in 
Mh. by increasing derivation depth, using syntactic weight to break ties. Then 
we choose all clauses in J and the first Ksize clauses from My, for use with the 
backend reasoner. We leave out clauses in M\(IUM),.) because Ehoh can rederive 
them. We also expect large clauses with deep derivations to be less useful. 

The remaining step is the translation of A-abstractions. We support two 
translation methods: A-lifting [24] and SKBCI combinators [48]. For SKBCI, we 
omit the combinator definition axioms, because they are very explosive [10]. A 
third mode simply omits clauses containing A-abstractions. 


Evaluation and Discussion. In Zipperposition, we can adjust the CPU time 
allotted to Ehoh, Ehoh’s own proof search parameters, the point when Ehoh is 
invoked, the number Ksize of selected clauses from Mho, and the A translation 
method. We fix the time limit to 5 s, use Ehoh in auto mode, and focus on the 
last three parameters. In base, collaboration with Ehoh is disabled. 

Ehoh is invoked after Kyime:t CPU seconds, where 0 < Kiime < 1 and t is the 
total CPU time allotted to Zipperposition. Figure 5 shows the effect of varying 
Kitime when Ksize = 32 and A-lifting is used. The evaluation confirms that using 
a highly optimized backend such as Ehoh greatly improves the performance of a 
less optimized prover such as Zipperposition. The figure indicates that it is prefer- 
able to invoke the backend early. We have indeed observed that if the backend 
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Uncoop Coop 


CVC4 1810 = 
Leo-III 1641 2108 
Satallax 2089 2224 
Vampire 2096 


Zipperposition 2223 2307 


Fig. 8: Comparison of competing higher-order theorem provers 


is invoked late, small clauses with deep derivations tend to be present by then. 
These clauses might have been used to delete important shallow clauses already. 
But due to their derivation depth, they will not be translated. In such situations, 
it is better to invoke the backend before the important clauses are deleted. 
Figure 6 quantifies the effects of the three \-abstraction translation methods. 
We fixed Kitime = 0.25 and Ksize = 32. The clear winner is )-lifting. Omitting 
clauses with A-abstractions performs comparably to SKBCI combinators. 
Figure 7 shows the effect of Kgize on performance, with Ktime = 0.25 and 
A-lifting. We find that including a small number of higher-order clauses with the 
lowest weight performs better than including a large number of such clauses. 


8 Comparison with Other Provers 


Different choices of parameters lead to noticeably different sets of proved prob- 
lems. In an attempt to use Zipperposition 2 to its full potential, we have created 
a portfolio mode that runs up to 50 configurations in parallel during the allot- 
ted time. To provide some context, we compare Zipperposition 2 with the latest 
versions of all higher-order provers that competed at CASC-J10: CVC4 1.8 [4], 
Leo-ITT 1.5 [42], Satallax 3.5 [11], and Vampire 4.5 [10]. Note that Vampire’s 
higher-order schedule is optimized for running on a single core. 

We use the same 2606 monomorphic higher-order TPTP 7.2.0 problems as 
elsewhere in this paper, but we try to replicate the CASC setup more faithfully. 
CASC-J10 was run on 8-core CPUs with a 120 s wall-clock limit and a 960 s 
CPU limit. Since we run the experiments on 4-core CPUs, we set the wall-clock 
limit to 240 s and keep the same CPU limit. Leo-II, Satallax, and Zipperposition 
are cooperative provers. We also run them in uncooperative mode, without their 
backends, to measure their intrinsic strength. Figure 8 summarizes the results. 

Among the cooperative provers, Zipperposition is the one that depends the 
least on its backend, and its uncooperative mode is only one problem behind 
Satallax’s cooperative mode. This confirms our hypothesis that A-superposition is 
a suitable basis for automatic higher-order reasoning. This also suggests that the 
implementation of this calculus in a modern first-order superposition prover such 
as E or Vampire would achieve markedly better results. Moreover, we believe that 
there are still techniques inspired by tableaux, SAT solving, and SMT solving 
that could be adapted and integrated in saturation provers. 
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9 Discussion and Conclusion 


Back in 1994, Kohlhase [27, Sect. 1.3] was optimistic about the future of higher- 
order automated reasoning: 


The obstacles to proof search intrinsic to higher-order logic may well be 
compensated by the greater expressive power of higher-order logic and 
by the existence of shorter proofs. Thus higher-order automated theorem 
proving will be practically as feasible as first-order theorem proving is 
now as soon as the technological backlog is made up. 


For higher-order superposition, the backlog consisted of designing calculus ex- 
tensions, heuristics, and algorithms that mitigate its weaknesses. In this paper, 
we presented such enhancements, justified their design, and evaluated them. We 
explained how each weak point in the higher-order proving pipeline could be im- 
proved, from preprocessing to reasoning about formulas, to delaying unpromis- 
ing or explosive inferences, to invoking a backend. Our evaluation indicates that 
higher-order superposition is now the state of the art in higher-order reasoning. 

Higher-order extensions of first-order superposition have been considered by 
Bentkamp et al. [6,7] and Bhayat and Reger [9,10]. They introduced proof cal- 
culi, proved them refutationally complete, and suggested optional rules, but they 
hardly discussed the practical aspects of higher-order superposition. Extensions 
of SMT are discussed by Barbosa et al. [3]. Bachmair and Ganzinger [1], Manna 
and Waldinger [29], and Murray [31] have studied nonclausal resolution calculi. 

In contrast, there is a vast literature on practical aspects of first-order rea- 
soning using superposition and related calculi. The literature evaluates various 
procedures and techniques [21,36], literal and term order selection functions [20], 
and clause evaluation functions [19,39], among others. Our work joins the select 
club of papers devoted to practical aspects of higher-order reasoning [8,16,41,53]. 

As a next step, we plan to implement the described techniques in Ehoh [50], 
the A-free higher-order extension of E. We expect the resulting prover to be sub- 
stantially more efficient than Zipperposition. Moreover, we want to investigate 
the proofs found by provers such as CVC4 and Satallax but missed by Zipper- 
position. Finding the reason behind why Zipperposition fails to prove specific 
problems will likely result in useful new techniques. 
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Abstract. Existing proof-generating quantified Boolean formula (QBF) solvers 
must construct a different type of proof depending on whether the formula is 
false (refutation) or true (satisfaction). We show that a QBF solver based on or- 
dered binary decision diagrams (BDDs) can emit a single dual proof as it oper- 
ates, supporting either outcome. This form consists of a sequence of equivalence- 
preserving clause addition and deletion steps in an extended resolution frame- 
work. For a false formula, the proof terminates with the empty clause, indicating 
conflict. For a true one, it terminates with all clauses deleted, indicating tautology. 
Both the length of the proof and the time required to check it are proportional to 
the total number of BDD operations performed. We evaluate our solver using a 
scalable benchmark based on a two-player tiling game. 


1 Introduction 


Adding quantifiers to Boolean formulas, yielding the logic of quantified Boolean for- 
mulas (QBFs), greatly extends their expressive power [11], but it presents several chal- 
lenges, including verifying the output of a QBF solver. Unlike a satisfiable Boolean 
formula, there is no satisfying assignment for a QBF—the formula is simply false or 
true. Instead, a proof-generating QBF solver must provide a full proof in either case: a 
refutation proof if the formula is false, or a satisfaction proof if the formula is true. 

Currently, there is little standardization of the proof capabilities or the proof sys- 
tems supported by different QBF solvers [21]. Some solvers can generate syntactic 
certificates—ones that can be directly checked by a proof checker. For a false formula, 
these can be expressed in clausal proof frameworks that augment resolution with rules 
for universal quantification [18]. For a true formula, several QBF solvers can generate 
term resolution proofs [12], effectively reasoning about a negated version of the input 
formula represented in disjunctive form. These require the proof checker to support an 
entirely different set of proof rules. 

An even larger number of solvers can generate semantic certificates in the form of 
Herbrand functions for false formulas and Skolem functions for true ones, describing 
how to instantiate either the universal or the existential variables [21]. These can be 
used to expand the original formula into a (often much larger) Boolean formula that is 
checked with a SAT solver [22] or with a high-degree polynomial algorithm [25]. Per- 
forming the check often requires far more effort than does running the solver. These ap- 
proaches, along with others involving syntactic certificates, require at least two passes— 
one to determine whether the formula is true or false and one to generate the proof. 
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This paper describes a new approach to proof generation for QBF, where the solver 
generates a dual proof, serving as either a refutation or a satisfaction proof depending 
on whether the solver determines the formula to be false or true. A dual proof consists 
of a sequence of clause addition and deletion steps, each preserving equivalence to the 
original formula. If the proof terminates with the addition of the empty clause, then it 
demonstrates that the original formula was contradictory and therefore false. If the proof 
terminates with all clauses removed, then it demonstrates that the original formula was 
equivalent to a tautology and is therefore true. The proofs are expressed in a clausal 
proof framework that incorporates extended resolution, as well as rules for universal 
and existential quantification [13, 14]. 

We have implemented a QBF solver PGBDDQ based on ordered binary decision di- 
agrams (BDDs) that can generate dual proofs as it operates. As optimizations, PGBDDQ 
can be directed to generate refutation or satisfaction proofs, and these can be somewhat 
shorter and take less time to check than dual proofs. Refutation proofs follow the tradi- 
tional format of a series of truth-preserving steps leading to an empty clause. Satisfac- 
tion proofs follow the novel format of a series of falsehood-preserving steps leading to 
an empty set of clauses. This approach for satisfaction proofs has been previously used 
as part of a QBF preprocessor [13, 14], but, to the best of our knowledge, ours is the 
first use in a complete QBF solver. Whether dual, refutation, or satisfaction, the proofs 
generated by PGBDDQ have length proportional to the number of BDD operations and 
can readily be validated by a simple proof checker. 

For the case of refutation proofs, PGBDDQ builds on the work of Jussila, et al. [17], 
whose BDD-based QBF solver EBDDRES could generate refutation proofs in an ex- 
tended resolution framework. Whereas their solver, as well as all other published BDD- 
based QBF solvers [23,24], require the BDD variable ordering to be the inverse of the 
quantification ordering, PGBDDQ allows independent choices for the two orderings. As 
will be shown, this can lead to an exponential advantage on some benchmarks. 

We evaluate the performance of PGBDDQ using a scalable benchmark based on a 
two-player tiling game. We show that, with the right combination of Tseitin variable 
placement, BDD variable ordering and elimination variable ordering, a BDD-based 
QBF solver can achieve performance that scales polynomially with the problem size. 
In these cases, PGBDDQ can readily outperform state-of-the-art search-based solvers, 
while having the added benefit that it generates a checkable proof. 


2 Background Preliminaries 


A literal | is either a variable y or its complement y. We denote the underlying variable 
for literal l as Var(l), while 7 denotes the complement of literal 1. 

A clause is a set of literals, representing the disjunction of a set of complemented 
and uncomplemented variables. The empty clause, indicating logical falsehood, is writ- 
ten L. We consider only proper clauses, where a literal can only occur once in a clause, 
and a clause cannot contain both a variable and its complement. Logical truth, or tau- 
tology, is denoted T and represented by an empty set of clauses. For clarity, we write 
clauses as Boolean formulas, such as z ^A y — z for the clause {7, 7, z}. As a special 
case, the unit clause consisting of literal / is simply written as J. 
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ITE: For Boolean values a, b, and c, the ITE operation (short for “If-Then-Else’’) is 
defined as: ITE(a, b,c) = (a Ab) V (~a ^ c). This can be also be written as a conjunction 
of clauses: ITE(a, b,c) = (a > b) A (~a > ©). 
QBF: We consider quantified formulas in prenex normal form over a set of input vari- 
ables X, with input formula ®; having the form r = Q,X1 Q2X2 -© QmXm Vir. 
The quantifier prefix Qr = Q1X1 Q2X2 --- QmXm consists of a series of quantifier 
blocks. Each block j has an associated quantifier Q; € {V,} and a set of variables 
X; C X, such that the sets X1, X2, ..., Xm form a partitioning of X. The formula 
matrix y 7 is given as a set of clauses referred to as the input clauses. An input variable 
x occurring in some partition X; is said to be universal (respectively, existential) when 
Q; = V (resp., Q; = J) and is said to be at quantification level j. The type and level of 
each literal | matches that of its underlying variable Var(1). 
Resolution: Let C and D be clauses, where C contains variable y and D contains its 
complement y. We also require that there can be no literal | € C, with l 4 y, such 
that 1 € D. The resolvent clause is then defined as Res(C,D) = C U D — {y, 9}. 
When C and D do not satisfy the above requirements, then Res(C,, D) is undefined. 
This definition does not allow the resolvent to be a tautology. 

The resolution operation extends to linear chains and sets of clauses, as well. For a 
clause sequence C1, C2,..., Cx, we define its resolvent as: 


Res(C,, C2, ..., Ck) = Res(C1, Res(Co,--- , Res(Cpr_1, Ck): )) 


The sequence C1, C2,...,C is termed the antecedent. Again, the operation is unde- 
fined if any individual application of the operation is undefined. For a set of clauses 
Wp, we define Res(w) as the set of all resolvents that can be generated from sequences 
comprised of clauses from w with each clause used at most once per sequence. 

As a separate notation, for a set of clauses w, we let Res, (y) be the set of all defined 
resolvents Res(C, D) with C, D € w, y € C, and y € D. 
Extension: Extended resolution [28] allows the introduction of extension variables to 
serve as a shorthand notation for other formulas. Generalizing extended resolution to 
quantified formulas requires additional considerations regarding 1) the distinction be- 
tween existentially and universally quantified variables, and 2) the position of the ex- 
tension variables within the quantification ordering. In particular, as extension variables 
are generated, they must be classified as existential and be inserted into intermediate 
positions in the ordering [3, 17]. To support this capability, we associate a quantifica- 
tion level A(y) with each input and extension variable y. For input variable x, where 
x € Xj, we define A(x) = 27 — 1. Input variables will therefore have odd values for 
À. Each extension variable e will be assigned an even value for A(e) according to rules 
defined below. For literal l, we define A(/) = A( Var(1)). 

As clauses are added and deleted, and as extension variables are introduced, a for- 
mula will be maintained with an overall form 


P= Qı Xı JF Q2X2 JE +-+- QmXm dBm Y (1) 


where F1, Es, ..., Em is a partitioning of the set of extension variables. The quantifier 
prefix Q in (1) is therefore an alternation of input and extension variables, with all 
extension variables being existentially quantified. We can also view the quantifier prefix 
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as simply being a set of variables y, being ordered by the values of A(y), and where y 
is universal when A(y) = 2j — 1 with Q; = V. Otherwise, y is existential. We use 
set notation when referring to the quantifier prefix, recognizing that the partitioning of 
variables into quantifier blocks and the associated quantifier types, are defined implicitly 
by the function A. 

Two quantifier prefixes Q and Q’, each with m input variable blocks, are said to be 
compatible when Q; = Qi, for 1 < j < m, and A(y) = à'(y) for ally E€ ON Q’, 
where the unprimed and primed symbols correspond to Q and Q’, respectively. 

Extension introduces existential variable e by adding a set of defining clauses 0 to 
the matrix and adding e to the quantifier prefix. Consider QBF = Qw. Let e be a 
fresh variable (i.e., e ¢ Q) and let 0 be a set of clauses that are blocked on e [5]. That 
is, each clause in 0 must contain either e or €, and for any clauses C, D € 0 for which 
e € Cand@ € D, there must be some other literal l € C such that T € D, and therefore 
Rese(0) = Ø. Define P’ = Q' y’ as follows. Variable e is assigned quantification level 
A(e) = max{Even(XA(y))|y € Var(0),y 4 e}, where Var(@) is defined to be the 
set of all variables occurring in the clauses in 0. Function Even rounds a number up 
to the next higher even value, i.e., Even(a) = 2 [a/2]. This definition guarantees that 
A(e) is even and that every variable y occurring in @ will have \(y) < A(e). Letting 
Q' = QU f{e} and Y’ = WU, it can be shown that P’ is true if and only if ® is true [17]. 
Boolean Functions: The restriction of Boolean function f with respect to variable x, 
denoted f|x is defined as the function that results when variable x is assigned value 1. 
Similarly, f|z is defined as the function that results when « is assigned value 0. 

The Shannon expansion relates a Boolean function to its restrictions with respect to 
a variable and its complement. For a function f and variable x: 


f = ITE(z, fle, fle) 
= (x > fle) A (Z > flr) (2) 
We will find clausal form (2) to be of use in generating satisfaction proofs. 
For Boolean function f and variable x we can define the existential and universal 


quantifications of f with respect to x as projection operations that eliminate the depen- 
dency on x through either disjunction or conjunction: 


dr f = fle V fle (3) 
Va f = fla fle (4) 


BDDs: A reduced, ordered binary decision diagram (BDD) provides a canonical form 
for representing a set of Boolean functions, and an associated set of algorithms for 
constructing them and testing their properties [1,7,8]. A set of functions is represented 
as a directed acyclic graph, with each function indicated by a pointer to its root node. 
We will therefore use the symbol u to refer at times to 1) a node in the BDD, 2) the 
subgraph of the BDD having u as its root, 3) the function represented by this subgraph, 
and 4) an extension variable associated with the node. 

The ordered BDD representation requires defining a total ordering of the variables. 
Unlike other BDD-based QBF solvers [17, 23, 24], PGBDDQ allows this ordering to be 
independent of the ordering of variables in the quantifier prefix. The two leaf nodes 
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are denoted Lo and L4, representing the constant functions 0 and 1, respectively. Each 
nonterminal node u has an associated variable and two children indicating branches for 
the two possible values of the variable. 

BDD packages support multiple operations for constructing and testing the prop- 

erties of Boolean functions represented by a BDD. A number of these are based on 
the Apply algorithm [6]. Given root nodes u and v representing functions f and g, re- 
spectively, and a Boolean operation (e.g., AND), the algorithm generates a root node 
w representing the result of applying the operation to those functions (e.g., f A g). 
It operates by traversing its arguments via a series of recursive calls, using a table to 
cache previously computed results. Variants of the Apply algorithm can also perform 
restriction and quantification. 
QBF Solving with a BDD: With the ability to perform disjunction, conjunction, and 
quantification of Boolean functions, there is a straightforward algorithm for solving a 
QBF with a BDD. It starts by computing a representation of the formula matrix using 
the Apply algorithm with operation V for each clause and conjuncting these using the 
Apply algorithm with operation ^. Then, quantifiers are eliminated by working from 
the innermost quantifier block Xm and working outward, using either universal or exis- 
tential quantifier operations. At the end, the BDD will be reduced to either Lo indicating 
that the formula is false, or L4 indicating that the formula is true. This basic algorithm 
can be improved by deferring some of the conjunctions and by carefully selecting the 
order of quantification within each quantifier block [23, 24]. 


3 Logical Foundations 


A clausal proof consists of a sequence of steps starting with the clauses in the input 
formula @;. Each step either adds a set of clauses, and possibly an extension variable, 
or it removes a set of clauses. These additions and removals define a sequence of QBFs 
@,,Po,...,D;, with &; = Pzr and each ©; of the form Q; y;. 

For a refutation proof, each step 7 must preserve truth, i.e., P; —> ®;.1, and it must 
end with L € Y+. This construction serves as a proof that r = 6; > 2 > --- > 
@, = L, and therefore the input formula is false. A satisfaction proof follows the same 
general format, except that it requires each step 7 to preserve falsehood: &;1 — &;, and 
it reaches a final result with Y, = Ø. This construction serves as a proof that T = $; > 
Pi—ı — --- — Bı = Pz, and therefore the input formula is true. A dual proof requires 
that each step preserves equivalence: ®; +> ;41, i.e., it is both truth and falsehood 
preserving. Only the final step with Y+ € {L, T } determines whether it is a refutation 
or a satisfaction proof. 


3.1 Inference Rules 


Table 1 shows the equivalence-preserving inference rules we use in our proofs. These 
are based on redundant clauses—cases where there are two sets of clauses p and 0 such 
that Ow + Q (4% U 0), for compatible prefixes Q and Q’. Thus, adding clauses 0 to 
the matrix y defines an equivalence-preserving addition rule, while deleting them from 
the matrix ¢ U 0 defines an equivalence-preserving removal rule. 
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Table 1. Inference rules where clause set 0 is redundant with respect to the clauses in %. 


Addition Removal Requirements 


Resolution addition Resolution deletion 6 C Res(1). 


Universal reduction — 6 = {C}. Luniversal. A(l’) < A(L) for all existential 
Vec.cu{hhew. 

Extension Existential y existential. y  Var(y). y € Var(C) for all C € 

elimination 0.Res (0) C Y. A(y’) < A(y) for all y’ € Var(0). 


We have already described resolution in Section 2. Universal reduction (also known 
as “forall reduction” [4, 17]) is the standard rule for eliminating universal variables in a 
QBF refutation proof [18]. 

The extension rule forms the basis for adding extension variable y = e and its 
defining clauses 0. For this case, the clauses in 0 are blocked with respect to y, and 
therefore Res,,(0) = Ø. As a deletion rule, the existential elimination rule is used to 
remove extension variables and their defining clauses, as well as to remove the existen- 
tial input variables. It is a generalization of blocked clause elimination [5] in that the 
clauses in 0 need not be blocked, as long as ~ contains all of the resolvents with respect 
to variable y. The redundancies used by the resolution, extension, and existential elimi- 
nation rules are special cases of the quantified resolution asymmetric tautology (QRAT) 
property [13, 14]. 


3.2 Integrating Proof Generation into BDD Operations 


As described in [16, 17,26] and [9], we use a BDD to represent Boolean functions 
defined by applying Boolean operations to the input variables X. When creating node 
u, We introduce an extension variable, also referred to as u, with up to four defining 
clauses. For node u with variable x, and children nodes wu; and wo, these clauses encode 
the formula u © ITE(x,u1, uo). As described in Section 2, we will have A(u) = 
max{A(x) + 1, A(u1), A(uo)}. 

As in [9], we associate leaf nodes Lo and L; directly with logical values L and T. 
When constructing node u, if either u or ug is a leaf node, the defining clauses may 
be simplified, and some may degenerate to tautologies. By defining \(L) = A(T) = 0, 
we can still use the above formula to define the value of A(u), such that A(u1) < A(u), 
A(uo) < A(w), and A(x) < A(u). This guarantees that the value of A(w) is greater or 
equal to that of any node or variable occurring in the subgraph with root u. 

For node u, define its support set S(u) as the set of variables occurring at some node 
in the subgraph with root u. Based on our construction, any node u will have \(w) = 27 
if and only if there is some j and some x for which x € X; N S(u), and this property 
does not hold for any 7’ > j. 

As a final notation, let 6(w) denote the set consisting of the defining clauses for all 
nodes in the subgraph with root u. 

The BDD package implements the set of operations shown in the Table 2. Each 
generates a result node w, and it also generates sets of clauses forming extended reso- 
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Table 2. Required BDD Operations. Each generates a root node plus a set of proofs. 


Operation Arguments Result Proved Properties 
Truth Preserving Falsehood Preserving 
FROMCLAUSE C u= Vicc! C,0(u)F u u, O(w) F C 
APPLYAND u, V w=uAv uv —> w w —> u, w —> v 
APPLYOR u, V w=uVu U>w,v>w wruVu 
RESTRICT u, l w = ul lAu>w lAwru 


lution proofs of some properties relating the result to the arguments. As shown, some 
of these properties are truth preserving, while others are falsehood preserving. In each 
of these, C indicates a clause, u, v, and w are BDD nodes (or their associated extension 
variables), and / is a literal of an input variable. 

These operations serve the following roles: 


FROMCLAUSE generates the BDD representation u of a clause C. It also generates 

a set of resolution steps proving that the unit clause u is logically entailed by the 

input clause and defining clauses: u € Res({C} U @(u)), and the converse: C € 

Res({u} U 0(u)). 

— APPLYAND generates the BDD representation w of the conjunction of its argu- 
ments. It also generates a proof that the extension variables for the argument and 
result nodes satisfy u A v — w, as well as a proof of the converse: w — u and 
w — v, and therefore w > u A v. 

— APPLYOR generates the BDD representation w of the disjunction of its arguments. 
Its generated proofs include u — w and v —> w, implying that u V v —> w, as well 
as the converse: w > u V v. 

— RESTRICT generates the restriction w of argument u with respect to literal l. It 

generates proofs that the operation satisfies downward implication: | A u > w, 

and also upward implication: | A w — u. This operation has the property that for 

x = Var(l), variable x will not occur in the subgraph with root w, i.e., x ¢ S(w). 


4 Integrating Proof Generation into a QBF Solver 


PGBDDQ solves a QBF by maintaining a set T of root nodes, which we refer to as 
“terms.” Each term is the result of conjuncting and applying elimination operations 
to some subset of the input clauses. T initially contains the root nodes for the BDD 
representations of the input clauses. The solver repeatedly removes one or two terms 
from T, performs a quantification or conjunction operation, and adds the result to T, 
except that terms with value Lı are not added. Quantifiers are eliminated in reverse 
order, starting with block X,,, and continuing through X,. The process continues until 
either some generated term is the leaf value Lo, indicating that the formula is false, or 
the set becomes empty, indicating that the formula is true. The solver simultaneously 
generates proof steps, including ones that add a unit clause u for each node u € T. 
Our presentation describes the general requirements for applying conjunction and 
elimination operations. These operations can be used to implement the basic method 
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described in Section 2, as well as more sophisticated strategies that defer conjunctions 
until they are required before performing some of the elimination operations [23,24]. 

Universal quantification commutes with conjunction and so can be applied to the 
terms independently. Applying existential quantification, on the other hand, requires 
performing conjunction operations until the variables to be quantified occur only in a 
single term. 


4.1 Dual Proof Generation 


For both technical and implementation reasons, which we explain below, we require the 
input formula to have only a single variable in each quantifier block. This restriction can 
be satisfied by rewriting an arbitrary QBF, such that a quantifier block with k variables 
is serialized, splitting it into a sequence of k distinct quantification levels. 

When generating a dual proof, the solver generates steps proving that each update 
to the set of terms T’ preserves equivalence with the input formula. More formally, 
consider a matrix 7 containing the following clauses: 1) unit clause u for each u € T, 
plus 2) all of the defining clauses 0(u) for the subgraph rooted by each node u € 
T. Let Q be the compatible quantifier prefix formed by augmenting input prefix Qy 
with the extension variables associated with the nodes in these subgraphs. Then each 
update preserves the invariant that O; Yr +> Q y. Furthermore, the solver takes care to 
systematically delete clauses once they are no longer needed, using the removal rules 
listed in Table 1. That enables it to finish with an empty set of clauses in the event the 
formula is true. The initial set of terms T consists of a root node u for each input clause 
C, and the solver uses the proof that C, 0(u) + u to justify adding unit clause u to the 
proof. It then uses this unit clause, plus the proof that u, 0(u) F C to justify deleting 
input clause C. 

Each step proceeds by generating new terms and by adding and removing clauses 

in the proof. Suppose the step involves computing results with root nodes w1, ..., Wn 
based on argument terms u1,...,u,. If any of the result nodes is BDD leaf Lo, then 
the formula is false. The solver can use truth-preserving rules generated by the BDD 
operations to justify adding an empty clause. Otherwise, the solver removes the argu- 
ment terms from T and adds the result nodes, except for any equal to BDD leaf L1. The 
solver uses the existing unit clauses plus the truth-preserving rules to justify adding unit 
clauses for each newly added term. It then uses the falsehood-preserving rules and the 
newly added unit clauses to justify deleting the unit clauses associated with the argu- 
ment terms. It must also explicitly generate rules to remove some intermediate clauses 
that are added during these proof constructions. Other clauses, including the defining 
clauses for the BDD nodes and the clauses added during the BDD operations get re- 
moved by a separate process described in Section 4.2. The net effect for each step then 
is to replace the argument terms in T by the non-constant result terms, maintaining a 
unit clause for each term in T as part of the proof. 
Conjunction operations. For u,v € T, the solver computes w = APPLYAND(u, v). 
For the case where w = Lo the generated truth-preserving proof will be the clause wVv, 
which resolves with unit clauses u and v to generate the empty clause—the solver has 
proved that the formula is false. 
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Otherwise, the solver sets T to be T — {u,v} U w. The proof for adding unit clause 
w follows by resolving the unit clauses u and v with the generated clause U V U V w, 
(i.e., u Av — w). The generated clauses w — u and w — v each resolve with unit 
clause w to justify deleting unit clauses u and v. 

Universal elimination operation. This operation is performed when Q; = V, and by 
our restriction, we must have X; = {x} for some universal variable x. We also require 
that the input variables for blocks X,, such that j’ > j have already been eliminated. 

Since universal quantification commutes with conjunction, the solver can quantify 
each term individually and let subsequent conjunction operations perform the conjunc- 
tion indicated in (4). That is, for each u € T such that x € S(u), operation RESTRICT 
is used to compute the two restrictions Wy = ul, and wy = u|z. These will generate 
proofs of two downward implications: l A u — w; for 1 € {a,Z}, as well as proofs of 
two upward implications: l A w, > u. 

If w, equals leaf node Lo for either l = x or l = 7, then the corresponding down- 
ward implication will be a clause of the form l^u — L = Į v 7. Resolving this with the 
unit clause u and applying universal reduction generates the empty clause—the solver 
has proved that the formula is false. 

Consider the general case, where neither wy nor wz is a leaf node. The solver sets 
T = TU {w,, wz} — {u}. The downward implications | ^ u — w; can be resolved 
with unit clause u to yield the clause 1 — w; for l € {x,%}. We can be certain that 
A(wi) < A(x) for both values of l, since x ¢ S(w,). Applying universal reduction to 
the two generated clauses then yields the unit clauses wy and wz. Resolving each unit 
clause w; with the upward implication /A w; — u gives the clause l — u, for! € {x,Z}. 
Resolving these with each other justifies deleting unit clause u. Intermediate clauses 
T —> w, £ > W, T > Wz, and T + wy are removed by resolution deletion. 

The case where one of the restrictions is the leaf node L; is handled similarly to the 
general case, except that this node is not added to T. 

Our implementation applies the conjunction operation to terms wz and wz imme- 
diately after they are generated to avoid causing the number of terms to expand by a 
factor of 2° when the formula contains a sequence of k universal quantifiers. 
Existential elimination operations. This operation is performed when Q; = 3. We 
can assume that X; = {x} for some existential variable x. We require that the input 
variables for blocks Xj such that j’ > j have already been eliminated. We also require 
the conjunction operations to have reduced T to contain at most one node u such that 
x € S(u). The solver proceeds as follows to existentially quantify x from u yielding a 
new term w and creating the justification for adding unit clause w. It also removes unit 
clause u, as well as some intermediate clauses. Note that w can equal L, but not Lo. 


1. Compute ux = RESTRICT(u, x) and ug = RESTRICT(w, T), generating proofs of 
the downward implications 7 A u —> u, and Z ^ u — ug, as well as the upward 
implications c\u, — u and TAuz — u. Resolving the two downward implications 
with the unit clause u justifies adding clauses Cy = £ —> uz and Cy = T > uz. 
These clauses form the Shannon expansions (2) of u with respect to variable x. 

2. For! € {x,Z}, resolving clause C; with the upward implication l ^u; — u justifies 
adding clauses x — u and z — u. Resolving these with each other justifies deleting 
unit clause u. This step completes the replacement of u by its Shannon expansion. 
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3. Apply clause removal to remove every clause containing a literal / such that A(/) > 
A(x) = 2j — 1. This is described in Section 4.2. 

4. Cz and Cz are the only clauses remaining that contain either x or %. Resolving 
these with each other justifies adding clause uz V ug. The existential elimination 
rule can now be applied to justify deleting Cs and Cy, with the result that there will 
be no further clauses containing any literal l with A(/) > A(x). 

5. Compute w = APPLYOR(u,,, uz), generating three proofs: us > w, uz —> w, and 
W— Ug V Ug. 

6. If w is leaf node L1, then the falsehood-preserving proof generated by APPLYOR 
derives the clause uy V uz. This proof justifies deleting the instance of this clause 
added in step 5. If w is a nonleaf node, then the first two proofs from Step 5 can be 
resolved with the clause uz V uz to justify adding unit clause w, and the third can 
be resolved with this unit clause to justify deleting clause uz V ug. This completes 
the replacement of u by the disjunction of its two restrictions, as in (3). 

7. If w is leaf node Ly, then set T to T — {u}. Otherwise, set it to T — {u} U {w}. 


Overall Operation: For a false formula, the solver will terminate with the generation 
of leaf value Lo during a conjunction or universal quantification operation. These cases 
will cause the proof to terminate with the addition of an empty clause. For a true for- 
mula, the solver will finish with T equal to the empty set, since it never adds a leaf node 
to T. A final clause removal operation with quantification level 0 then yields y, = 0. 

We can see now why we impose the restriction that any quantifier block X; with 
Q; = V contain only one variable. Without it, the universal variable elimination opera- 
tion may not be possible. Suppose X; = {a,x}. Attempting to perform the universal 
quantification operation on variable x could yield a BDD node wy, with either l = x or 
| = 7, that depends on x’. That would require that A(w;) > A(x’) = A(x), and so the 
universal reduction rule could not be applied. Serializing the universal blocks avoids 
this difficulty, without limiting the generality of the solver. 


4.2 Clause Removal 


As a dual proof proceeds, the BDD operations cause clauses to be added as extension 
variables are introduced and as inferences are made via resolution. Other clauses are 
added and removed explicitly by the proof steps, including the unit clauses for each 
term and the intermediate clauses generated by the steps. In order to support having the 
outcome of the solver be true, the defining and resolution clauses must be removed in 
order to ultimately end up with an empty set of clauses. The solver must justify their 
removal, since clause deletion is not, in general, equivalence preserving. 

Clause removal is triggered when performing existential quantification, just before 
applying the variable elimination rule with variable x to remove clauses C, and Cz 
(step 3). We must first ensure that there are no other clauses containing x or T. 

Our method is to remove any clause C containing a literal | for which A(l) > 
A(x) = 2j — 1. Clause removal can proceed by stepping through the clauses in the 
reverse order from how they were added. If a clause that was added by resolution con- 
tains a literal 7 with A(7) > 27, it can be removed via resolution deletion, using the same 
antecedent as was used when it was added. 
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Suppose the solver encounters the defining clauses for a node u with A(u) > 2). 
It can be certain that all clauses added by resolution that contain either u or u have 
already been removed, since these must have followed the introduction of u in the clause 
ordering. Similarly, any parent node v of u must have already had its defining clauses 
removed, since the defining clauses for v must occur after those for u. The existential 
elimination rule can therefore be used to remove the defining clauses for u. 

Working through the set of clauses in reverse order, the solver may encounter clauses 
added by resolution and defining clauses containing only literals / with A(l) < 2j — 1. 
These need not be removed, and indeed they can prove useful (clauses added by reso- 
lution) or necessary (some defining clauses) for subsequent proof steps. They will be 
deleted by clause removal during later phases. 

We can see now why we impose the restriction that any quantifier block X; with 
Q; = 3 contain only one variable. It enables the use of the À values to determine 
which clauses should be removed to eliminate any dependency on existential variable x. 
Serializing the existential quantifier blocks allows this scheme to work without limiting 
the generality of the solver. 


4.3 Specializing to Refutation or Satisfaction Proofs 


Dual proofs have the advantage that they can be generated as a single pass, without 
knowing in advance whether the formula is true or false. On the other hand, they are, 
by necessity, somewhat longer and require more time to generate and to check. Another 
approach is to know (or guess) what the outcome will be and then direct the solver to 
generate a pure refutation or satisfaction proof. Specializing the proof generation to one 
of these forms is straightforward, and it can take advantage of more efficient ways to 
perform some of the quantifications. 

A refutation proof need only justify that each step preserves truth. This enables sev- 
eral optimizations. Observe that deleting a clause always preserves truth, because it can 
only cause the set of satisfying solutions for the matrix to expand. Therefore clause 
deletion can be performed without any justification and instead be incorporated into 
the BDD garbage collection process [9]. Second, the BDD package need not gener- 
ate the falsehood-preserving proofs shown in Table 2, reducing the number of clauses 
generated. Finally, the existential operation of (3) is inherently truth preserving. BDD 
packages can implement the quantification of a function by an entire set of variables via 
a variant of the Apply algorithm. If the quantification of root node u generates result 
node w, then the solver can run an implication test after the BDD computation has been 
performed to prove that u — w, as is done with our SAT solver [9]. This avoids the 
need to serialize existential quantifier blocks and to have the solver generate low-level 
proof steps for each existential variable. 

Conversely, a satisfaction proof need only justify that each step preserves falsehood. 
Adding a clause always preserves falsehood, since it can only reduce the set of satis- 
fying solutions for the matrix, and therefore clause addition can be performed without 
any justification. In addition, the BDD package need not generate the truth-preserving 
proofs shown in Table 2. Finally, universal quantification can be performed on an en- 
tire block of variables producing node w from argument u. The solver can then run an 
implication test to generate a proof that w —> u. 
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5 Experimental Results 


PGBDDQ! is written entirely in Python and consists of around 3350 lines of code, in- 
cluding a BDD package, support for generating extended-resolution proofs, and the 
overall QBF solver. By comparison, our proof-generating BDD-based SAT solver re- 
quired around 2130 lines of code [9]. PGBDDQ can generate proofs in either the QRAT 
format [13, 14] or in a format we call QPROOF that supports just the proof rules given 
in Table 1. The latter format requires explicit lists of antecedents, and therefore each 
step can be checked without any search. 

The overall control of PGBDDQ is based on a form of bucket elimination [10], where 
each quantifier block X; defines a bucket. It starts by generating BDD representations 
of the input clauses. The resulting terms are inserted into buckets according to the value 
of A(w) for each root node u. As described in Section 3.2, this value will be 27 when 
u contains a variable from block X; in its support, and it has no variables at higher 
quantification levels. 

Processing proceeds from the highest numbered bucket downward. For a universal 
level, quantification is performed for each bucket element individually with the results 
placed into buckets according to their values for À. For an existential level, the elements 
are conjuncted and then existential quantification is performed. The result is placed into 
a bucket according to its value of À. 

We can see that this approach defers conjunction as long as possible, only operating 
on terms at some quantification level j that truly depend on one or more variables in X4. 
Similar techniques have been used in other BDD-based QBF solvers [23,24]. However, 
other implementations place terms into buckets according to the BDD level of their root 
nodes, requiring the BDD variables to be ordered as the inverse of the quantification 
ordering. By labeling each node with its value of A, we can determine the appropriate 
bucket from the root node without regard to the BDD variable ordering. 

We have tested PGBDDQ on a number of scalable benchmark problems, finding it 
performs well in some cases, scaling polynomially, and poorly in others, scaling expo- 
nentially. Here we present results for a problem based on a two-player game. It provides 
insights into how polynomial scaling can be achieved, as well as the performance of the 
solver and two checkers. 

Two-player games provide a rich set of benchmarks for QBF solvers, with each turn 
being translated into a quantification level. To encode the game from the perspective 
of the first player (Player A), A’s turns are encoded with existential quantifiers, while 
the second player’s (Player B) turns are encoded with universal quantifiers. The formula 
will be true if the game has a guaranteed winning strategy for A. The encoding of a game 
into QBF constrains the two players to only make legal moves. It also expresses the 
conditions under which A is the winner, namely that the game consist of t consecutive 
moves, for an odd value of t. Conversely, we can encode the formula where B has a 
winning strategy by reversing the quantifiers and expressing that the game must consist 
of an even number of consecutive moves. For a game where no draws are possible, these 
two formulas will be complementary. 


' A demonstration version, complete with solver, checker, and benchmarks, is available at 
https://github.com/rebryant/pgbddq-artifact. 
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Consider a game played on a 1 x N grid of squares with a set of dominos, each of 
which can cover two squares. Players alternate turns, each placing a domino to cover 
two adjacent squares. The game completes when no more moves are possible, taking at 
most | N/2] turns. The first player who cannot place a domino loses. This linear domino 
placement game is isomorphic to the object-removal game “Dawson’s Kales” [2]. It can 
be shown that player B has a winning strategy for N € {0,1,15,35} as well as for all 
values of the form 347 + c where i > 0 and c € {5,9, 21, 25, 29} [27]. 


The game is encoded as a QBF by introducing a set of N — 1 input variables for each 
possible move, each corresponding to the boundary between a pair of adjacent squares. 
A set of N — 1 Tseitin variables encodes the board state after each move, and sets of 
clauses enforce the conditions that 1) each move should cover exactly one boundary, 
and 2) neither that boundary nor the two adjacent ones should have been covered pre- 
viously. In all, there are around N?/4 universal input variables, N?/4 existential input 
variables, and 32/2 Tseitin variables. The number of clauses grows as O(N?) due 
to the quadratic number of clauses to enforce the exactly-one constraints on the input 
variables for each move. 


To achieve polynomial performance, we found that several problem-specific tech- 
niques are required. First, the Tseitin variables for a given move are placed in an exis- 
tential quantifier block immediately following the block for the input variables for the 
move. This is logically equivalent to the usual convention of placing all Tseitin vari- 
ables in an innermost quantifier block, but it enables the bucket elimination algorithm 
to process the clauses for each move in sequence, rather than expanding the formulas in 
terms of only the input variables at the outset. Second, all variables are ordered for the 
BDD in “boundary-major” ordering. That is, all variables, including input and Tseitin 
variables, for the first boundary on the board are included from the first quantification 
level to the last. The variables for the second boundary follow similarly, and so on for 
all N — 1 boundaries. This ordering has the effect that, when processing the clauses for 
some move, the variables encoding the next, and previous state for a boundary, as well 
as the proposed change to its state, are localized within the ordering. Finally, when split- 
ting a quantifier block into a series of single-variable blocks, we ordered them according 
to their BDD variable ordering. Since the solver eliminates variables in the reverse of 
their quantifier ordering, this convention causes the disjunction and conjunction opera- 
tions of Equations (3) and (4) to be performed mainly on subgraphs of the BDD below 
the variables being quantified. This enables greater use of previously computed results 
via the operation cache. 


Table 3 shows the performance of PGBDDQ, two checkers, and two other QBF 
solvers on the domino placement game as functions of N. It shows first cases where 
the encoded player has a winning strategy, and therefore the formula is true, and then 
cases where the encoded player’s opponent has a winning strategy, and therefore the 
formula is false. Dual proofs were generated for both cases. For measurements with 
sufficient data points, we show the scaling trends, obtained by performing a linear re- 
gression on the logarithms of data generated for each value of N in increments of 5. 
All measurements were performed on a 4.2 GHz Intel Core 17 (17-7700K) processor 
with 32 GB of memory running the MacOS operating system. Times are measured in 
elapsed seconds. 
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Table 3. Experimental Results for Dual Proof Generation with Linear Domino Placement Game. 
The first data series are for proofs of true formulas, and the second are for false formulas. Entries 


shown as ““—” indicate cases where the program exceeded a 7200-second time limit. 


N  Winner/ Input PGBDDQ Other solvers 


Player Clauses Total Clauses Solve Qproof QRAT-TRIM DEPQBF GHOSTQ 


10 A/A 666 132,138 3d 3:3 3.4 0.1 0.0 
15 B/B 1,725 628,392 15.2 15.7 43.8 3.8 1.3 
20 A/A 3,880 2,572,139 67.3 65.3 605.0 1896.6 57.9 


23 B/B 6,637 7,098,146 202.6 199.5 4265.6 — — 
40 A/A 24,010 83,736,352 3358.6 3479.5 


Trend Ne N*5 N*8 N*8 
10 A/B 664 132,403 3.1 32 7.3 0.1 0.0 
15 B/A 1,728 629,530 15.2 15.5 108.7 3.6 1.0 
20 A/B 3,885 2,580,284 67.2 66.7 1521.5 — 49.1 
25 B/A 6,631 7,083,515 205.1 190.0 — — 6942.2 
40 A/B 24,000 83,662,168 3279.2 3457.4 

Trend N N*5 N*8 N*8 


As indicated in the column labeled “Input Clauses,” the number of clauses grows 
as N27, not quite reaching the asymptotic value of N?. The number of proof clauses 
generated by PGBDDQ are nearly the same for both true and false formulas, with growth 
rates of N*-°. The time taken by the solver (labeled “Solve’’) , and by our own checker 
(“Qproof”’) scale at about the same rate as the number of proof clauses. 

We also benchmarked the QBF proof checker QRAT-TRIM [13, 14]. This program 
was already equipped to handle our forms of refutation and satisfaction proofs, and it 
can handle dual proofs without modification. The only concession to the idiosyncrasies 
of PGBDDQ was to serialize the universal quantifier blocks in the prefix of false formu- 
las. This is required to enable application of the universal reduction rule. The existential 
blocks can stay intact, since our only reason to serialize these is to guide the clause re- 
moval process. Although the scaling of QRAT-TRIM is poor, it is encouraging that the 
solver can be verified by a checker that predates it by a number of years. 

For comparison, we evaluated the performance of two other QBF solvers on this 
benchmark: DEPQBF, version 6.0 [20], and GHOSTQ [15,19]. We found they are both 
very fast for smaller values of N but then reach a narrow range of values for which 
they transition from running in just a few seconds to exceeding the timeout limit of 
7200 seconds. For DEPQBF, this transition occurs as N ranges from 17 to 21, and for 
GHOSTQ, as N ranges from 21 to 26. PGBDDQ is much slower for small values of NV, 
but it keeps scaling without hitting a sudden cutoff. 

Although we did not run EBDDRES [17], we can use PGBDDQ to evaluate the im- 
pact of having the BDD variable ordering be the inverse of the quantifier ordering. Our 
experiments show that this ordering causes the runtime and proof sizes to scale expo- 
nentially in N. With N = 14 and B as the player, PGBDDQ runs for 4100 seconds to 
generate a refutation proof with 114,157,025 clauses. By contrast, a boundary-major 
ordering requires just 6 seconds and generates a proof with 309,387 clauses. 
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Table 4. Experimental Results for Specialized Proof Generation with Linear Domino Placement 
Game. The first data series are for satisfaction proofs, and the second are for refutation proofs. 


N  Winner/Player Input Clauses Total Clauses Solve Qproof 


10 A/A 666 90,924 1.8 1.4 
20 A/A 3,880 1,516,756 36.4 24.0 
30 A/A 11,166 10,466,168 346.0 192.6 
40 A/A 24,010 44,874,662 1990.3 1254.8 
45 A/A 32,24 74,891,554 4033.4 2760.8 
Tend N27 N44 N48 N47 
10 A/B 664 126,127 2.4 1.8 
20 A/B 3,885 1,232,252 27.3 18.6 
30 A/B 11,159 7,084,367 180.0 121.6 
40 A/B 24,010 26,150,238 7713.9 565.0 
50 A/B 43,904 85,077,630 2955.4 2151.4 


Trend N27 N*° N*3 N*3 


Table 4 shows the advantage of generating specialized proofs when the formula is 
known in advance to be true or false. Comparing the columns labeled “Total Clauses” in 
Tables 3 and 4, we can see especially that refutation proofs are asymptotically shorter. 
These can take advantage of the more efficient approach to existential quantification in 
handling the large number of Tseitin variables. Again, the solution and checking time 
track the proof sizes. These optimizations allowed us to solve larger instances of the 
problem—up to N = 45 for true instances and N = 50 for false ones. 


6 Conclusions 


We have demonstrated that a QBF solver can emit a single proof as it operates, leading 
to either an empty clause for a false formula or an empty set of clauses for a true one. 
Both the proof and the time required to check it scale as the number of BDD operations 
performed. Moreover, a BDD-based QBF solver can allow the choice of BDD variable 
ordering to be made independently from the quantifier ordering. This feature can be 
critical to obtaining performance that scales polynomially with the problem size. 

Our prototype is only a start in implementing a fully automated QBF solver. Such 
a solver must be able to choose a BDD variable ordering based on the input formula 
structure. It must also be able to identify and move Tseitin variables to earlier positions 
in the quantifier ordering, generating proof steps justifying that this transformation is 
equivalence preserving. 

The underlying operation of PGBDDQ has potential applications beyond QBF solv- 
ing. The program could stop the process described in Section 4.1 at any point and gen- 
erate a QBF that is provably equivalent to the input formula. PGBDDQ could therefore 
be used as a preprocessor for other solvers, and for other applications that require rea- 
soning about Boolean formulas with quantifiers. 


Acknowledgements. The second author is supported by NSF grant CCF-2010951. 
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Abstract. We present a fast and reliable reconstruction of proofs gener- 
ated by the SMT solver veriT in Isabelle. The fine-grained proof format 
makes the reconstruction simple and efficient. For typical proof steps, 
such as arithmetic reasoning and skolemization, our reconstruction can 
avoid expensive search. By skipping proof steps that are irrelevant for 
Isabelle, the performance of proof checking is improved. Our method 
increases the success rate of Sledgehammer by halving the failure rate 
and reduces the checking time by 13%. We provide a detailed evaluation 
of the reconstruction time for each rule. The runtime is influenced by 
both simple rules that appear very often and common complex rules. 


Keywords: automatic theorem provers - proof assistants - 
proof verification 


1 Introduction 


Proof assistants are used in verification and formal mathematics to provide 
trustworthy, machine-checkable formal proofs of theorems. Proof automation 
reduces the burden of finding proofs and allows proof assistant users to focus on 
the core of their arguments instead of technical details. A successful approach 
implemented by “hammers,” like Sledgehammer for Isabelle [15], is to heuristically 
selects facts from the background; use an external automatic theorem prover, 
such as a satisfiability modulo theories (SMT) solver [12], to filter facts needed 
to discharge the goal; and to use the filtered facts to find a trusted proof. 

Isabelle does not accept proofs that do not go through the assistant’s inference 
kernel. Hence, Sledgehammer attempts to find the fastest internal method that 
can recreate the proof (preplay). This is often a call of the smt tactic, which runs 
an SMT solver, parses the proof, and reconstructs it through the kernel. This 
reconstruction allows the usage of external provers. The smt tactic was originally 
developed for the SMT solver Z3 [18,34]. 


© The Author(s) 2021 
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The SMT solver CVC4 [10] is one of the best solvers on Sledgehammer 
generated problems [14], but currently does not produce proofs for problems with 
quantifiers. To reconstruct its proofs, Sledgehammer mostly uses the smt tactic 
based on Z3. However, since CVC4 uses more elaborate quantifier instantiation 
techniques, many problems provable for CVC4 are unprovable for Z3. Therefore, 
Sledgehammer regularly fails to find a trusted proof and the user has to write the 
proofs manually. veriT [19] (Sect. 2) supports these techniques and we extend 
the smt tactic to reconstruct its proofs. With the new reconstruction (Sect. 3), 
more smt calls are successful. Hence, less manual labor is required from users. 

The runtime of the smt method depends on the runtime of the reconstruction 
and the solver. To simplify the reconstruction, we do not treat veriT as a black 
box anymore, but extend it to produce more detailed proofs that are easier 
to reconstruct. We use detailed rules for simplifications with a combination of 
propositional, arithmetic, and quantifier reasoning. Similarly, we add additional 
information to avoid search, e.g., for linear arithmetic and for term normalization. 
Our reconstruction method uses the newly provided information, but it also has 
a step skipping mode that combines some steps (Sect. 4). 

A very early prototype of the extension was used to validate the fine-grained 
proof format itself [7, Sect. 6.2, second paragraph]. We also published some details 
of the reconstruction method and the rules [25] before adapting veriT to ease 
reconstruction. Here, we focus on the new features. 

We optimize the performance further by tuning the search performed by veriT. 
Multiple options influence the execution time of an SMT solver. To fine-tune 
veriT’s search procedure, we select four different combinations of options, or 
strategies, by generating typical problems and selecting options with complemen- 
tary performance on these problems. We extend Sledgehammer to compare these 
four selected strategies and suggest the fastest to the user. We then evaluate the 
reconstruction with Sledgehammer on a large benchmark set. Our new tactic 
halves the failure rate. We also study the time required to reconstruct each rule. 
Many simple rules occur often, showing the importance of step skipping (Sect. 5). 

Finally, we discuss related work (Sect. 6). Compared to the prototype [25], 
the smt tactic is now thoroughly tested. We fixed all issues revealed during 
development and improved the performance of the reconstruction method. The 
work presented here is integrated into Isabelle version 2021; i.e., since this version 
Sledgehammer can also suggest veriT, without user interaction. To simplify future 
reconstruction efforts, we document the proof format and all rules used by veriT. 
The resulting reference manual is part of the veriT documentation [40]. 


2 veriT and Proofs 


The SMT solver veriT is an open source solver based on the CDCL(7) calculus. 
In proof-production mode, it supports the theories of uninterpreted functions 
with equality, linear real and integer arithmetic, and quantifiers. To support 
quantifiers veriT uses quantifier instantiation and extensive preprocessing. 
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veriT’s proof syntax is an extension of SMT-LIB [11] which uses S-expressions 
and prefix notation. The proofs are refutation proofs, i.e., proofs of L. A proof 
is an indexed list of steps. Each step has a conclusion clause (cl ..) and is 
annotated with a rule, a list of premises, and some rule-dependent arguments. 
veriT distinguishes 90 rules [40]. Subproofs are the key feature of the proof format. 
They introduce an additional context. Contexts are used to reason about binders, 
e.g., preprocessing steps like transformation under quantifiers. 

The conclusions of rules with contexts are always equalities. The context 
models a substitution into the free variables of the term on the left-hand side 
of the equality. Consider the following proof fragment that renames the variable 
name x to vr, as done during preprocessing: 


(assume a0 (exists (x A) (£f x)) 
(anchor :step t3 :args (:= x vr)) 
(step t1 (cl (= x vr)) :rule refl) 
(step t2 (cl (= (f x) (f vr))) :rule cong :premises (t1)) 
(step t3 (cl (= (exists (x A) (£f x)) 
(exists (wr A) (f vr))) :rule bind) 


The assume command repeats input assertions or states local assumptions. In 
this fragment the assumption a0 is not used. Subproofs start with the anchor 
command that introduces a context. Semantically, the context is a shorthand for 
a lambda abstraction of the free variable and an application of the substituted 
term. Here the context is x +> vr and the step t1 means (Ax. x) vr = vr. The 
step is proven by congruence (rule cong). Then congruence is applied again (step 
t2) to prove that (Ax. f x) vr = f vr and step t3 concludes the renaming. 

During proof search each module of veriT appends steps onto a list. Once 
the proof is completed, veriT performs some cleanup before printing the proof. 
First, a pruning phase removes branches of the proof not connected to the root L. 
Second, a merge phase removes duplicated steps. The final pass prepares the 
data structures for the optional term sharing via name annotations. 


3 Overview of the veriT-Powered smt Tactic 


Isabelle is a generic proof assistant based on an intuitionistic logic framework, 
Pure, and is almost always only used parameterized with a logic. In this work we 
use only Isabelle/HOL, the parameterization of Isabelle with higher-order logic 
with rank-1 (top level) polymorphism. Isabelle adheres to the LCF [26] tradition. 
Its kernel supports only a small number of inferences. Tactics are programs that 
prove a goal by using only the kernel for inferences. The LCF tradition also 
means that external tools, like SMT solvers, are not trusted. 

Nevertheless, external tools are successfully used. They provide relevant facts 
or a detailed proof. The Sledgehammer tool implements the former and passes 
the filtered facts to trusted tactics during preplay. The smt tactic implements 
the latter approach. The provided proof is checked by Isabelle. We focus on the 
smt tactic, but we also extended Sledgehammer to also suggest our new tactic. 
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The smt tactic translates the current goal to the SMT-LIB format [11], runs 
an SMT solver, parses the proof, and replays it through Isabelle’s kernel. To 
choose the smt tactic the user applies (smt (z3)) to use Z3 and (smt (verit)) 
to use veriT. We will refer to them as z-smt and v-smt. The proof formats of Z3 
and veriT are so different that separate reconstruction modules are needed. The 
v-smt tactic performs four steps: 


1. It negates the proof goal to have a refutation proof and also encodes the goal 
into first-order logic. The encoding eliminates lambda functions. To do so, it 
replaces each lambda function with a new function and creates app operators 
corresponding to function application. Then veriT is called to find a proof. 

2. It parses the proof found by veriT (if one is found) and encodes it as a 
directed acyclic graph with L as the only conclusion. 

3. It converts the SMT-LIB terms to typed Isabelle terms and also reverses the 
encoding used to convert higher-order into first-order terms. 

4. It traverses the proof graph, checks that all input assertions match their 
Isabelle counterpart and then reconstructs the proof step by step using the 
kernel’s primitives. 


4 Tuning the Reconstruction 


To improve the speed of the reconstruction method, we create small and well- 
defined rules for preprocessing simplifications (Sect. 4.1). Previously, veriT implic- 
itly normalized every step; e.g., repeated literals were immediately deleted. It now 
produces proofs for this transformation (Sect. 4.2). Finally, the linear-arithmetic 
steps contain coefficients which allow Isabelle to reconstruct the step without 
relying on its limited arithmetic automation (Sect. 4.3). On the Isabelle side, the 
reconstruction module selectively decodes the first-order encoding (Sect. 4.4). To 
improve the performance of the reconstruction, it skips some steps (Sect. 4.5). 


4.1 Preprocessing Rules 


During preprocessing SMT solvers perform simplifications on the operator level 
which are often akin to simple calculations; e.g., a x 0 x f(x) is replaced by 0. 

To capture such simplifications, we create a list of 17 new rules: one rule per 
arithmetic operator, one to replace boolean operators such as XOR with their 
definition, and one to replace n-ary operator applications with binary applica- 
tions. This is a compromise: having one rule for every possible simplification 
would create a longer proof. Since preprocessing uses structural recursion, the 
implementation simply picks the right rule in each leaf case. The example above 
now produces a prod_simplify step with the conclusion a x 0 x f(a) = 0. Previ- 
ously, a single step of the connect_equiv rule collected all those simplifications 
and no list of simplifications performed by this rule existed. The reconstruction 
relied an experimentally created list of tactics to be fast enough. 

On the Isabelle side, the reconstruction is fast, because we can direct the 
search instead of trying automated tactics that can also work on other parts of 
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the formula. For example, the simplifier handles the numeral manipulations of 
the prod_simplify rule and we restrict it to only use arithmetic lemmas. 

Moreover, since we know the performed transformations, we can ignore some 
parts of the terms by generalizing, i.e., replacing them by constants [18]. Because 
generalized terms are smaller, the search is more directed and we are less likely 
to hit the search-depth limitation of Isabelle’s auto tactic as before. Overall, the 
reconstruction is more robust and easier to debug. 


4.2 Implicit Steps 


To simplify reconstruction, we avoid any implicit normal form of conclusions. For 
example, a rule concluding t v P for any formula t can be used to prove P v P. In 
such cases veriT automatically normalizes the conclusion P v P to P. Without 
a proof of the normalization, the reconstruction has to handle such cases. 

We add new proof rules for the normalization and extend veriT to use them. 
Instead of keeping only the normalized step, both the original and the normalized 
step appear in the proof. For the example above, we have the step P v P and 
the normalized P. To remove a double negation ——t we introduce the tautology 
——-—t v t and resolve it with the original clause. Our changes do not affect any 
other part of veriT. The solver now also prunes steps concluding T. 

On the Isabelle side, the reconstruction becomes more regular with fewer 
special cases and is more reliable. The reconstruction method can directly re- 
construct rules. To deal with the normalization, the reconstruction used to first 
generate the conclusion of the theorem and then ran the simplifier to match the 
normalized conclusion. This could not deal with tautologies. 

We also improve the proof reconstruction of quantifier instantiation steps. One 
of the instantiation schemes, conflicting instances [8,36], only works on clausified 
terms. We introduce an explicit quantified-clausification rule qnt_cnf issued 
before instantiating. While this rule is not detailed, knowing when clausification 
is needed improves reconstruction, because it avoids clausifying unconditionally. 
The clausification is also shared between instantiations of the same term. 


4.3 Arithmetic Reasoning 


We use a proof witness to handle linear arithmetic. When the propositional 
model is unsatisfiable in the theory of linear real arithmetic, the solver creates 
la_generic steps. The conclusion is a tautological clause of linear inequalities 
and equations and the justification of the step is a list of coefficients so that 
the linear combination is a trivially contradictory inequality after simplification 
(e.g., 0 > 1). Farkas’ lemma guarantees the existence of such coefficients for reals. 
Most SMT solvers, including veriT, use the simplex method [21] to handle linear 
arithmetic. It calculates the coefficients during normal operation. 

The real arithmetic solver also strengthens inequalities on integer variables 
before adding them to the simplex method. For example, if x is an integer the 
inequality 2x < 3 becomes x < 1. The corresponding justification is the rational 
coefficient 1/2. The reconstruction must replay this strengthening. 
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The complete linear arithmetic proof step 1 < x v 2x < 3 looks like 


(step til (cl (< 1 x) (< (* 2 x) 3)) 
:rule la_generic :args (1 (div 1 2))) 


The reconstruction of an la_generic step in Isabelle starts with the goal 
V į TCi where each c; is either an equality or an inequality. The reconstruction 
method first generalizes over the non-arithmetic parts. Then it transforms the 
lemma into the equivalent formulation cı = --- => Cn = L and removes all 
negations (e.g., by replacing ~a < b with b> a). 

Next, the reconstruction method multiplies the equation by the corresponding 
coefficient. For example, for integers, the equation A < B, and the coefficient 2/4 
(with p > 0 and q > 0), it strengthens the equation and multiplies by p to get 


p x (Adivq) + p x (if B modq = 0 then 1 else 0) < p x (B div q). 


The if-then-else term (if B mod q = 0 then 1 else 0) corresponds to the strength- 
ening. If Bmodq = 0, the result is an equation of the form A’ +1 < B’, i.e., 
A’ < B’. No strengthening is required for the corresponding theorem over reals. 

Finally, we can combine all the equations by summing them while being 
careful with the equalities that can appear. We simplify the resulting (in)equality 
using Isabelle’s simplifier to derive L. 

To replay linear arithmetic steps, Isabelle can also use the tactic linarith as 
used for Z3 proofs. It searches the coefficients necessary to verify the lemma. 
The reconstruction used it previously [25], but the tactic can only find integer 
coefficients and fails if strengthening is required. Now the rule is a mechanically 
checkable certificate. 


4.4 Selective Decoding of the First-order Encoding 


Next, we consider an example of a rule that shows the interplay of the higher-order 
encoding and the reconstruction. To express function application, the encoding 
introduces the first-order function app and constants for encoded functions. The 
proof rule eq_congruent expresses congruence on a first-order function: (tı # 
ui) V... V (tn Æ Un) v f(ti,.-. tn) = f(u1,..-, Un). With the encoding it can 
conclude f 4 f'v x #2’ v app(f,x) = app(f’, 2’). If the reconstruction unfolds 
the entire encoding, it builds the term f 4 f/ vxr4a'v fx= f'x'. It then identifies 
the functions and the function arguments and uses rewriting to prove that if 
f=] and x=’, then fx = fr. 

However, Isabelle G-reduces all terms implicitly, changing the term structure. 
Assume f := Av. x =a and f’ := Av. a = x. After unfolding all constructs that 
encode higher-order terms and after 3-reduction, we get (Ax. x = a) 4 (Av. a = 
x’) v (x #2") v (x =a) = (a = y'). The reconstruction method cannot identify 
the functions and function arguments anymore. 

Instead, the reconstruction method does not unfold the encoding including 
app. This eliminates the need for a special case to detect lambda functions. Such 
a case was used in the previous prototype, but the code was very involved and 
hard to test (such steps are rarely used). 
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4.5 Skipping Steps 


The increased number of steps in the fine-grained proof format slows down recon- 
struction. For example, consider skolemization from Jz. P x. The proof from Z3 
uses one step. veriT uses eight steps—first renaming it to (Jz. P x) = (dv. P v) 
(with a subproof of at least 2 steps), then concluding the renaming to get (3v. P v) 
(two steps), then (3v. P v) = P (ev. P v) (with a subproof of at least 2 steps), 
and finally P (ev. P v) (two steps). 

To reduce the number of steps, our reconstruction skips two kinds of steps. 
First, it replaces every usage of the or rule by its only premise. Second, it skips 
the renaming of bound variables. The proof format treats Va. P x and Vy. P y 
as two different terms and requires a detailed proof of the conversion. Isabelle, 
however, uses De Bruijn indices and variable names are irrelevant. Hence, we 
replace steps of the form (Va. P x) = (Vy. P y) by a single application of 
reflexivity. Since veriT canonizes all variable names, this eliminates many steps. 

We can also simplify the idiom “equiv_pos2; th_resolution”. veriT gener- 
ates it for each skolemization and variable renaming. Step skipping replaces it 
by a single step which we replay using a specialized theorem. 

On proof with quantifiers, step skipping can remove more than half of the 
steps—only four steps remain in the skolemization example above (where two 
are simply reflexivity). However, with step skipping the smt method is not an 
independent checker that confirms the validity of every single step in a proof. 


5 Evaluation 


During development we routinely tested our proof reconstruction to find bugs. As 
a side effect, we produced SMT-LIB files corresponding to the calls. We measure 
the performance of veriT with various options on them and select five different 
strategies (Sect. 5.1). We also evaluate the repartition of the tactics used by 
Sledgehammer for preplay (Sect. 5.2), and the impact of the rules (Sect. 5.3). 

We performed the strategy selection on a computer with two Intel Xeon 
Gold 6130 CPUs (32 cores, 64 threads) and 192 GiB of RAM. We performed 
Isabelle experiments with Isabelle version 2021 on a computer with two AMD 
EPYC 7702 CPUs (128 cores, 256 threads) and 2 TiB of RAM. 


5.1 Strategies 


veriT exposes a wide range of options to fine-tune the proof search. In order 
to find good combinations of options (strategies), we generate problems with 
Sledgehammer and use them to fine-tune veriT’s search behavior. Generating 
problems also makes it possible to test and debug our reconstruction. 

We test the reconstruction by using Isabelle’s Mirabelle tool. It reads theories 
and automatically runs Sledgehammer [14] on all proof steps. Sledgehammer 
calls various automatic provers (here the SMT solvers CVC4, veriT, and Z3 and 
the superposition prover E [38]) to filter facts and chooses the fastest tactic that 
can prove the goal. The tactic smt is used as a last resort. 
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Table 1. Options corresponding to the different veriT strategies 


Name Options 

default (no option) 

del_insts --index-sorts --index-fresh-sorts --ccfv-breadth --inst-deletion 
--index-SAT-triggers --inst-deletion-loops --inst-deletion-track-var 

ccfuv SIG --triggers-new --index-SIG --triggers-sel-rm-specific 

ccfu_insts --triggers-new --index-sorts --index-fresh-sorts --triggers-sel-rm-specific 
--triggers-restrict-combine --inst-deletion-loops --index-SAT-triggers 
--inst-deletion-track-vars --ccfv-index=100000 --ccfv-index-full=1000 
--inst-sorts-threshold=100000 --ematch-exp=10000000 --inst-deletion 

best --triggers-new --index-sorts --index-fresh-sorts --triggers-sel-rm-specific 


To generate problems for tuning veriT, we use the theories from HOL-Library 
(an extended standard library containing various developments) and from the 
formalizations of Green’s theorem [2,3], the Prime Number Theorem [23], and 
the KBO ordering [13]. We call Mirabelle with only veriT as a fact filter. This 
produces SMT files for representative problems Isabelle users want to solve and 
a series of calls to v-smt. For failing v-smt calls three cases are possible: veriT 
does not find a proof, reconstruction times out, or reconstruction fails with an 
error. We solved all reconstruction failures in the test theories. 

To find good strategies, we determine which problems are solved by several 
combination of options within a two second timeout. We then choose the strategy 
which solves the most benchmarks and three strategies which together solve the 
most benchmarks. For comparison, we also keep the default strategy. 

The strategies are shown in Table 1 and mostly differ in the instantiation 
schemes. The strategy del_insts uses instance deletion [6] and uses a breadth- 
first algorithm to find conflicting instances. All other strategies rely on extended 
trigger inference [29]. The strategy ccfu_STG uses a different indexing method for 
instantiation. It also restricts enumerative instantiation [35], because the options 
--index-sorts and --index-fresh-sorts are not used. The strategy ccfv_insts increases 
some thresholds. Finally, the strategy best uses a subset of the options used by 
the other strategies. Sledgehammer uses best for fact filtering. 

We have also considered using a scheduler in Isabelle as used in the SMT 
competition. The advantage is that we do not need to select the strategy on 
the Isabelle side. However, it would make v-smt unreliable. A problem solved by 
only one strategy just before the end of its time slice can become unprovable on 
slower hardware. Issues with z-smt timeouts have been reported on the Isabelle 
mailing list, e.g., due to an antivirus delaying the startup [27]. 


5.2 Improvements of Sledgehammer Results 


To measure the performance of the v-smt tactic, we ran Mirabelle on the full HOL- 
Library, the theory Prime Distribution Elementary (PDE) [22], an executable 
resolution prover (RP) [37], and the Simplex algorithm [30]. We extended Sledge- 
hammer’s proof preplay to try all veriT strategies and added instrumentation for 
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Table 2. Outcome of Sledgehammer calls showing the total success rate (SR, higher is 
better) of one-liner proof preplay, the number of suggested v-smt (OL,) and z-smt (OL-) 
one-liners, and the number of preplay failures (PF, lower is better), in percentages of 
the unique goals. 


HOL-Library PNT RP Simplex 
(13 562 goals) (1715 goals) (1658 goals) (1982 goals) 
SR OL, OL, PF| SR OL, OL, PF| SR OL, OL: PF| SR OL, OL, PF 

Fact-filter prover: CVC4 
z-smt | 54.5 2.7 1.5)33.1 3.7 0.8/64.8 1.3 0.8/51.6 1.6 0.9 
both |55.5 2.5 1.1 0.5/33.6 3.6 0.6 0.3/65.3 1.4 0.4 0.3/52.1 1.1 1.0 0.4 
Fact-filter prover: E 
z-smt | 55.5 1.1 1.7}36.0 0:3- 1.7 161.7 0.7 1.2|49.8 1.4 0.7 
both |56.0 0.8 0.7 1.3|36.4 0.6 0.1 1.3|62.1 0.9 0.2 0.8|49.9 0.3 1.3 0.5 
Fact-filter prover: veriT 
z-smt | 48.5 1.7 1.2}26.1 1.5 0.5}58.2 0.9 0.7/46.7 0.9 1.0 
both |49.4 1.6 0.9 0.4/26.5 14 04 0.2/58.6 1.1 0.3 0.2/47.4 1.0 06 0.3 
Fact-filter prover: Z3 
z-smt | 50.8 2.5 0.8|27.9 2.7 0.4|60.4 0.8 0.7)48.3 0.9 0.3 
both |51.3 1.9 1.1 0.3|28.2 2.5 0.5 0.1/60.9 1.1 0.1 0.2/48.4 0.4 0.6 0.2 


the time of all tried tactics. Sledgehammer and automatic provers are mostly non- 
deterministic programs. To reduce the variance between the different Mirabelle 
runs, we use the deterministic MePo fact filter [33] instead of the better perform- 
ing MaSh [28] that uses machine learning (and depends on previous runs) and 
underuse the hardware to minimize contention. We use the default timeouts of 
30 seconds for the fact filtering and one second for the proof preplay. This is 
similar to the Judgment Day experiments [17]. The raw results are available [1]. 


Success Rate. Users are not interested in which tactics are used to prove a goal, 
but in how often Sledgehammer succeeds. There are three possible outcomes: 
(i) a successfully preplayed proof, (ii) a proof hint that failed to be preplayed 
(usually because of a timeout), or (iii) no proof. We define the success rate as 
the proportion of outcome (i) over the total number of Sledgehammer calls. 

Table 2 gathers the results of running Sledgehammer on all unique goals and 
analyzing its outcome using different preplay configurations where only z-smt 
(the baseline) or both v-smt and z-smt are enabled. Any useful preplay tactic 
should increase the success rate (SR) by preplaying new proof hints provided by 
the fact-filter prover, reducing the preplay failure rate (PF). 

Let us consider, e.g., the results when using CVC4 as fact-filter prover. The 
success rate of the baseline on the HOL-Library is 54.5% and its preplay failure 
rate is 1.5%. This means that CVC4 found a proof for 54.5% +1.5% = 56% of the 
goals, but that Isabelle’s proof methods failed to preplay many of them. In such 
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cases, Sledgehammer gives a proof hint to the user, which has to manually find a 
functioning proof. By enabling v-smt, the failure rate decreases by two thirds, from 
1.5% to 0.5%, which directly increases the success rate by 1 percentage point: new 
cases where the burden of the proof is moved from the user to the proof assistant. 
The failure rate is reduced in similar proportions for PNT (63%), RP (63%), and 
Simplex (56%). For these formalizations, this improvement translates to a smaller 
increase of the success rate, because the baseline failure rate was smaller to begin 
with. This confirms that the instantiation technique conflicting instances [8,36] 
is important for CVC4. 

When using veriT or Z3 as fact-filter prover, a failure rate of zero could 
be expected, since the same SMT solvers are used for both fact filtering and 
preplaying. The observed failure rate can partly be explained by the much smaller 
timeout for preplay (1 second) than for fact filtering (30 seconds). 

Overall, these results show that our proof reconstruction enables Sledgeham- 
mer to successfully preplay more proofs. With v-smt enabled, the weighted average 
failure rate decreases as follows: for CVC4, from 1.3% to 0.4%; for E, from 1.5% 
to 1.2%; for veriT, from 1.0% to 0.3%; and for Z3, from 0.7% to 0.3%. For the 
user, this means that the availability of v-smt as a proof preplay tactic increases 
the number of goals that can be fully automatically proved. 


Saved time. Table 3 shows a different view on the same results. Instead of the 
raw success rate, it shows the time that is spent reconstructing proofs. Using 
the baseline configuration, preplaying all formalizations takes a total of 250.1 + 
33.4 + 37.2 + 42.8 = 363.5 seconds. When enabling v-smt, some calls to z-smt 
are replaced by faster v-smt calls and the reconstruction time decreases by 13% 
to 212.6 + 28.4 + 34.4 + 41.6 = 317 seconds. Note that the per-formalization 
improvement varies considerably: 15% for HOL-Library, 15% for PNT, 7.5% for 
RP, and 4.0% for Simplex. 

For the user, this means that enabling v-smt as a proof preplay tactic may 
significantly reduce the verification time of their formalizations. 


Impact of the Strategies. We have also studied what happens if we remove a 
single veriT strategy from Sledgehammer (Table 4). The most important one 
is best, as it solves the highest number of problems. On the contrary, default is 
nearly entirely covered by the other strategies. ccfv SIG and del_insts have a 
similar number where they are faster than Z3, but the latter has more unique 
goals and therefore, saves more time. Each strategy has some uniquely solved 
problems that cannot be reconstructed using any other. The results are similar 
for the other theories used in Table 3. 


5.3 Speed of Reconstruction 


To better understand what the key rules of our reconstruction are, we recorded the 
time used to reconstruct each rule and the time required by the solver over all calls 
attempted by Sledgehammer including the ones not selected. The reconstruction 
ratio (reconstruction over search time) shows how much slower reconstructing 
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Table 3. Preplayed proofs (Pr.) and their execution time (s) when using CVC4 as 
fact-filter prover. Shared proofs are found with and without v-smt and new proofs 
are found only with v-smt. The proofs and their associated timings are categorized in 
one-liners using v-smt (OL,), z-smt (OL.), or any other Isabelle proof methods (OL.). 


Total Shared proofs New proofs 
Total = OL, + OL, + OL. OL, 
Pr. | Time = Time (Pr.) + Time (Pr.) + Time ( Pr. )|Time (Pr. 
HOL- z-smt|7409| 250.1 = 85.0 (362) + 165.1 (7047 
Library both |7545| 212.6 = 27.9 211) + 19.6 (152) + 165.1 (7047) | 34.7 ( 135 
PNT z-smt| 569) 33.4 = 14.8 ( 64) + 18.5 ( 505 
both | 577| 28.4= 7.7(54)+ 2.1( 10)+ 18.5( 505 3.4 8 
RP z-smt| 1077) 37.2 = 8.7 ( 22) + 28.5 (1055 
both |1085| 344= 4.5(16)+ 1.4( 6)+ 28.5 (1055 2.2 8 
Simplex z-smt| 1024| 42.8 = 6.7 ( 32) + 36.0 ( 992 
P both | 1033| 41.6= 2.4(13)+ 3.2( 19)+ 36.0( 992 3.0 9 


Table 4. Reconstruction time and number of solved goals when removing a single 
strategy (HOL-Library results only), using CVC4 as fact filter. 


Shared proofs New proofs 
OL, OL, OL, 
Time Proofs Time Proofs Time Proofs 


No best 16.5 119 50.6 244 25.9 94 


No ccfv_SIG 27.0 198 22.6 164 33.5 123 
No ccfu_threshold 28.3 211 19.6 152 33.9 130 
No del_insts 27.4 201 21.8 162 32.9 124 
No default 27.9 207 20.1 156 33.8 134 
Baseline 27.9 211 19.6 152 34.7 135 


compared to finding a proof is. For the 25% of the proofs, Z3’s concise format 
is better and the reconstruction is faster than proof finding (first quartile: 0.9 
for v-smt vs. 0.1 for z-smt). The 99th percentile of the proofs (18.6 vs. 27.2) 
shows that veriT’s detailed proof format reduces the number of slow proofs. The 
reconstruction is slower than finding proofs on average for both solvers. 

Fig. 1 shows the distribution of the time spent on some rules. We remove the 
slowest and fastest 5% of the applications, because garbage collection can trigger 
at any moment and even trivial rules can be slow. Fig. 2 gives the sum of all 
reconstruction times over all proofs. We call parsing the time required to parse 
and convert the veriT proof into Isabelle terms. 

Overall, there are two kinds of rules: (1) direct application of a sequence of 
theorems—e.g., equiv_pos2 corresponds to the theorem —(a = b) v ~a v b— 
and (2) calls to full-blown tactics—like qnt_cnf (Sect. 4.2). 

First, direct application of theorems are usually fast, but they occur so often 
that the cumulative time is substantial. For example, cong only needs to unfold 
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assumptions and apply reflexivity and symmetry of equality. However, it appears 
so often and sometimes on large terms, that it is an important rule. 

Second, rules which require full-blown tactics are the slowest rules. For qnt_ 
cnf (CNF under quantifiers, see Sect. 4.2), we have not written a specialized 
tactic, but rely on Isabelle’s tableau-based blast tactic. This rule is rather slow, 
but is rarely used. It is similar to the rule la_generic: it is slow on average, but 
searching the coefficients takes even more time. 

We can also see that the time required to check the simplification steps that 
were formerly combined into the connect_equiv rule is not significant anymore. 

We have performed the same experiments with the reconstruction of the SMT 
solver Z3. In contrast to veriT, we do not have the amount of time required for 
parsing. The results are shown in Figs. 3 and 4. The rule distribution is very 
different. The nnf-neg and nnf-pos rules are the slowest rules and take a huge 
amount of time in the worst case. However, the coarser quantifier instantiation 
step is on average faster than the one produced by veriT. We suspect that 
reconstruction is faster because the rule, which is only an implication without 
choice terms, is easier to check (no equality reordering). 


6 Related Work 


The SMT solvers CVC4 [10], Z3 [34], and veriT [19] produce proofs. CVC4 
does not record quantifier reasoning in the proof, and Z3 uses some macro rules. 
Proofs from SMT solvers have also been used to find unsatisfiability cores [20], 
and interpolants [32]. They are also useful to debug the solver itself, since unsound 
steps often point to the origin of bugs. Our work also relates to systems like 
Dedukti [5] that focuses on translating proof steps, not on replaying them. 

Proof reconstruction has been implemented in various systems, including 
CVC4 proofs in HOL Light [31], Z3 in HOL4 and Isabelle/HOL [18], and veriT [4] 
and CVC4 [24] in Coq. Only veriT produces detailed proofs for preprocessing and 
skolemization. SMTCogq [4,24] currently supports veriT’s version 1 of the proof 
output which has different rules, does not support detailed skolemization rules, 
and is implemented in the 2016 version of veriT, which has worse performance. 
SMTCogq also supports bit vectors and arrays. 

The reconstruction of Z3 proofs in HOL4 and Isabelle/HOL is one of the 
most advanced and well tested. It is regularly used by Isabelle users. The Z3 
proof reconstruction succeeds in more than 90% of Sledgehammer benchmarks [14, 
Section 9] and is efficient (an older version of Z3 was used). Performance numbers 
are reported [16,18] not only for problems generated by proof assistants (including 
Isabelle), but also for preexisting SMT-LIB files from the SMT-LIB library. 

The performance study by Böhme [16, Sect. 3.4] uses version 2.15 of Z3, 
whereas we use version 4.4.0 which currently ships with Isabelle. Since version 
2.15, the proof format changed slightly (e.g., th-lemma-arith was introduced), 
fulfilling some of the wishes expressed by Böhme and Weber [18] to simplify 
reconstruction. Surprisingly, the nnf rules do not appear among the five rules 
that used the most runtime. Instead, the th-lemma and rewrite rules were the 
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Fig. 1. Timing, sorted by the median, of a subset of veriT’s rules. From left to right, 
the lower whisker marks the 5th percentile, the lower box line the first quartile, the 
middle of the box the median, the upper box line the third quartile, and the upper 
whisker the 95th percentile. 


parsing ~ 1 | 
la-generic + | 
bfun-elim ie - 
minus-simplify - 
sum-simplify + i 
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Percentage of time 


Fig. 2. Total percentage spent on each rule for the SMT solver veriT in the same order 
as Fig. 1. This graph maps the rules already shown in Fig. 1 to the total amount of 
time. The slowest rules are th_resolution (14.7%), parsing (10.3%), and cong (9.77%). 
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parsing } 


th-lemmaz-arith j- 1 


nnf-neg | +} 
pull-quant | 
quant-intro |} 
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quant-inst H 
unit-resolution HH 


nnf-pos HH 
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rewrite 
refl 
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commutativity 
mp f 
hypothesis f+ 
m 


= 


apply-def }- 


time (ms) (lower is better) 


Fig. 3. Timing of some of Z3’s rules sorted by median. From left to right, the lower 
whisker marks the 5th percentile, the lower box line the first quartile, the middle of the 
box the median, the upper box line the third quartile, and the upper whisker the 95th 
percentile. nnf-neg’s 95th percentile is 87 ms, nnf-pos’s is 33 ms, and parsing’s is 25 ms. 
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Fig. 4. Total amount of time per rule for the SMT solver Z3. 


reconstruction time. 


25 30 
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slowest. Similarly to veriT, the cong rule was among the most used (without 
accounting for the most time), but it does not appear in our Z3 tests. 

CVC4 follows a different philosophy compared to veriT and Z3: it produces 
proofs in a logical framework with side conditions [39]. The output can contain 
programs to check certain rules. The proof format is flexible in some aspects and 
restrictive in others. Currently CVC4 does not generate proofs for quantifiers. 


7 Conclusion 


We presented an efficient reconstruction of proofs generated by a modern SMT 
solver in an interactive theorem prover. Our improvements address reconstruction 
challenges for proof steps of typical inferences performed by SMT solvers. 


By studying the time required to replay each rule, we were able to compare the 
reconstruction for two different proof formats with different design directions. The 
very detailed proof format of veriT makes the reconstruction easier to implement 
and allows for more specialization of the tactics. On slow proofs, the ratio of time 
to reconstruct and time to find a proof is better for our more detailed format. 
Integrating our reconstruction in Isabelle halves the number of failures from 
Sledgehammer and nicely completes the existing reconstruction method with Z3. 


Our work is integrated into Isabelle version 2021. Sledgehammer suggests the 
veriT-based reconstruction if it is the fastest tactic that finds the proof; so users 
profit without action required on their side. We plan to improve the reconstruction 
of the slowest rules and remove inconsistencies in the proof format. The developers 
of the SMT solver CVC4 are currently rewriting the proof generation and plan 
to support a similar proof format. We hope to be able to reuse the current 
reconstruction code by only adding support for CVC4-specific rules. Generating 
and reconstructing proofs from the veriT version with higher-order logic [9] 
could also improve the usefulness of veriT on Isabelle problems. The current 
proof rules [40] should accommodate the more expressive logic. 
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Abstract. We explore the Collatz conjecture and its variants through the lens of 
termination of string rewriting. We construct a rewriting system that simulates 
the iterated application of the Collatz function on strings corresponding to mixed 
binary—ternary representations of positive integers. Termination of this rewriting 
system is equivalent to the Collatz conjecture. To show the feasibility of our ap- 
proach in proving mathematically interesting statements, we implement a minimal 
termination prover that uses the automated method of matrix/arctic interpretations 
and we perform experiments where we obtain proofs of nontrivial weakenings of 
the Collatz conjecture. Finally, we adapt our rewriting system to show that other 
open problems in mathematics can also be approached as termination problems 
for relatively small rewriting systems. Although we do not succeed in proving 
the Collatz conjecture, we believe that the ideas here represent an interesting new 
approach. 


1 Introduction 


Let N = {0,1,2,...} denote the natural numbers and Nt = {1,2,3,...} denote the 
positive integers. We define the Collatz function C: Nt + N* as 


n/2 ifn=0 (mod 2) 
C(n) = i 
3n+1 ifn=1 (mod 2). 


Given a function f and a number k € N, the function f? denotes the kth iterate of f. 
The well-known Collatz conjecture is the following: 


Conjecture 1. For all n € NF, there exists some k € N such that C*(n) = 1. 


This is a longstanding open problem and there is a vast literature dedicated to its study. 
For its history, we refer the reader to the comprehensive surveys by Lagarias [17-19]. 


Definition 1 (Convergent function). Consider a function f: X — X. Given x € X, 
the sequence of iterates f (x) := (x, f(x), f?(x),...) is called the f-trajectory of zx. 
For some designated element z € X, if for all x € X the trajectory f(x) contains z, 
the function f is called convergent. 


* The full version is available at https://www.cs.cmu.edu/~eyolcu/research/rewriting-collatz.pdf. 


© The Author(s) 2021 
A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 468-484, 2021. 
https://doi.org/10.1007/978-3-030-79876-5_27 
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In this paper, we describe an approach based on termination of string rewriting to 
automatically search for a proof of the Collatz conjecture. Although trying to prove 
the Collatz conjecture via automated deduction is clearly a moonshot goal, there are 
two recent technological advances that provide reasons for optimism that at least some 
interesting variants of the problem might be solvable. First, the invention of the method 
of matrix interpretations and its variants such as arctic interpretations turns the quest 
of finding a ranking function to witness termination into a problem that is suitable for 
systematic search. Second, the progress in satisfiability (SAT) solving makes it possible 
to solve many seemingly difficult combinatorial problems efficiently in practice. Their 
combination, i.e., using SAT solvers to find interpretations, has so far been effective in 
solving challenging termination problems. We make the following contributions: 

— We show how a generalized Collatz function can be expressed as a rewriting system 
that is terminating if and only if the function is convergent. 

— We show that translations into rewriting systems that use non-unary representations 
of numbers are empirically more amenable to automation compared with their 
previously and more commonly studied counterparts that use unary representations. 

— We automatically prove various weakenings of the Collatz conjecture and observe 
that only relatively large matrix/arctic interpretations exist for some generalized 
Collatz functions. Existing termination tools often limit their default strategies to 
search for small interpretations as they are tailored for the setting where the task is 
to quickly solve a large quantity of relatively easy problems. We make the point that, 
given more resources, the interpretation method has the potential to scale. 

— We observe that the phase-saving heuristic used in modern SAT solvers degrades the 
performance of CDCL solvers on formulas encoding the existence of matrix/arctic 
interpretations, whereas using negative branching improves solver performance. 

— We present adaptations of our rewriting system that allow reformulating several 
more open problems in mathematics as termination problems of small size. 


2 Preliminaries 


2.1 String Rewriting Systems 


Definition 2 (String rewriting system). Let X be an alphabet, i.e., a set of symbols. A 
string rewriting system (SRS) over X is a relation R C X* x X*. Elements (€,r) € R 
are called rewrite rules and are usually written as L —> r. The system R induces a rewrite 
relation +p := {(sét, srt) | s,t € X*, L —> r € Ry on the set X* of strings. 


Definition 3 (Termination). A relation — on A is terminating (denoted SN(->)) if 
there is no infinite sequence 89,8 ,... E€ A such that si > si+ı for alli > 0. 


We conflate an SRS R with the rewrite relation it induces, writing “R is terminating” 
instead of “— pg is terminating”. The following is a useful generalization of termination: 


Definition 4 (Relative termination). For SRSs R and S, the system R is said to be 
terminating relative to S (denoted SN(R/S)) if every sequence of rewrites for the system 
RUS applies the rules from R at most finitely many times. 
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Relative termination allows proofs to be broken into steps as codified by the following. 


Lemma 1 (Rule removal [29, Theorem 1]). Let R be an SRS. If there exists a subset 
T C R such that SN(T/R) and SN(R \ T), then SN(R). 


This lemma allows us to “remove rules” in the following way. When proving SN(R), 

if we succeed at finding a subset T satisfying SN(T'/R), the proof obligation becomes 

weakened to SN(R\T), where the rules of T are no longer present. This removal of rules 

can be repeated until no rules remain, thus producing a stepwise proof of termination. 
Another useful technique is reversal: 


Lemma 2 (Rule reversal [29, Lemma 2]). For a string s = s1... Sn E€ X*, denote 
s'’Y := sn ...81 and define the reversal of an SRS Ras R := {0% — r" | £ — 
r € R}. For SRSs R and S, we have SN(R/S) if and only if SN(R*Y /S*°’). 


Reversal is of interest because methods for proving termination are not necessarily 
invariant under reversal, that is, a given technique may fail to show termination of a 
system R while succeeding for its reversal R°Y. 

Yet another important notion is top termination: 


Definition 5 (Top termination). Let R be an SRS over X. The top rewrite relation 
induced by R is defined as >r, = {(€8,rs) | s E€ X*, L > r € R}. If > Rop is 
terminating, R is said to be top terminating. 


In plain language, top termination allows rewrites to be performed only at the leftmost end 
of a string. As we will see in the next section (Theorem 1), top termination problems can 
admit proofs of a more relaxed form compared to termination. Relative top termination, 
i.e., proving SN(Rtop/S) for SRSs R and S, is a crucial component in the dependency 
pair approach [1] which reduces a termination problem to a relative top termination 
problem that is often easier to solve. In order to avoid requiring familiarity with the 
dependency pair approach, we omit its discussion, and instead prove a self-contained 
result (Lemma 4) that encapsulates dependency pairs in a more elementary manner for 
the specific rewriting systems that we consider in this paper. 


2.2 Interpretation Method 


We state (at a high level) the key results on matrix/arctic interpretations that we use in 
our implementation. For more details we refer the reader to existing work [2,6, 10, 15,26]. 
With the interpretation method, the main idea is to find a ranking function that assigns 
a value to each string such that it decreases strictly when the string is modified by an 
application of a rewrite rule. If for all strings the value is bounded from below, then it 
cannot decrease indefinitely, ruling out the existence of an infinite sequence of rewrites. 
Formally, we search for an instance of the following: 


Definition 6 (Extended/weakly monotone algebra). Let X be an alphabet, A a set, 
[o]: A > Aan interpretation for every o € X, > and = order relations over A such 
that > is well-founded and = satisfies > - = C >. Letting |]s = {[o] | o € X} 
the structure (A, |-]»,>,2) is a weakly monotone X-algebra if for every o € X the 
interpretation |o] is monotone with respect to =. It is an extended monotone X-algebra 
if, additionally, for every o € X the interpretation |o] is monotone with respect to >. 
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We extend the interpretation from symbols to strings s = s,...5, E X* as |s] := 
[s1] o -+ -o [sn]. The following general theorem characterizes relative termination (resp. 
top termination) as the existence of extended (resp. weakly) monotone algebras. 


Theorem 1 ([6, Theorem 2]). Let R and S be SRSs over the alphabet X. We have 
SN(R/S) (resp. SN(Rtop/S)) if and only if there exists an extended (resp. weakly) 
monotone 5/-algebra (A, |] s, >, 2) such that 

- for each rule £ > r € R we have |H] (x) > [r](x) forall x € A, 

- for each rule £ —> r € S we have |4] (x) Z [r](x) for alla € A. 


~N 


An effective way to prove relative (top) termination is to try to satisfy the conditions 
of the above theorem by fixing (A, >, Z) and algorithmically searching for appropriate 
interpretations of symbols. Matrix interpretations is an instance of this method. We fix 
a dimension d, set A = N¢, define Z > y = z; > y; foralli € {1,...,d}, and 
define T > Y — > TZ y A zı > yı. For interpreting each symbol o € X, we 
consider an affine function [o] (7) = M,#+v,. In this way, the structure (N%, |] s, >, 2) 
satisfies the requirements of Definition 6 for a weakly monotone algebra. Additionally 
setting (1/7, )1,1 = 1 satisfies the requirements for an extended monotone algebra. Matrix 
interpretations can also be adapted to the max—plus algebra of arctic numbers A := 
NU{-—oo} as coefficients with different arithmetic operations and order relations [15,26]. 


Example 1. Let R = {aa — aba} and S = {b — bb}. The following functions 
constitute a matrix interpretations proof that shows SN(R/S). 


[a](z) = b | Z+ fil [b] (2) = F 4 Z+ A 


It can be checked that the above interpretations give an extended monotone algebra and 
that they satisfy the following for all 7 € N?, which implies SN(R/S) via Theorem 1. 


a= i j ex i > b J T fil =a 
aa = [o olz fol 2 [0 ol 7+ fol = w0 


In order to automate the search for the interpretations given a rewriting system R, an 
effective approach is to encode all of the aforementioned constraints as a propositional 
formula in CNF and use a SAT solver to look for a satisfying assignment. This addition- 
ally involves fixing a finite domain for the coefficients that can occur in the interpretations 
and encoding arithmetic over the chosen finite domain using propositional variables. 


2.3 Generalized Collatz Functions 


We consider instances of the following generalization of the Collatz function. Its variants 
have commonly appeared in the literature [3, 12, 14, 16,21,24, 27]. 


Definition 7 (Generalized Collatz function). Let X be one of N, N*, or Z and define 
Xı = X U{L}.A function f: X; + X is a generalized Collatz function if f(L) = 
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L and there exist an integer d > 2 and rational numbers qo,.--,Qd—1;0;+++5Td—1 
such that for all 0 <i < d — 1 andall n € X, we have 


f(n)=qin+r; ifn=i (mod d) 
or f(n)=L ifn =i (mod d). 


In the above, we allow the representation of a partially defined function by mapping to 
L in the undefined cases. We call a partial f convergent if all f-trajectories contain L. 

Note that the Collatz function corresponds to a generalized one with d = 2, qo = 1/2, 
ro = 0, qı = 3, rı = 1. Although the Collatz function is by far the most widely studied 
case, there are several other concrete examples of generalized Collatz functions the 
convergence of which is worth studying due to their connections to open problems in 
number theory and computability theory. We discuss these cases in Section 5. 


3 Rewriting the Collatz Function 


We start with systems that use unary representations and then demonstrate via examples 
that mixed base representations can be more suitable for use with automated methods. 


3.1 Rewriting in Unary 


The following system of Zantema [29] simulates iterated application of the Collatz 
function to a number represented in unary, and terminates upon reaching 1. 


Example 2. Z denotes the following SRS, consisting of 5 symbols and 7 rules. 


h1ll — 1h lihe + 11sọ hie + tile 
is > s1 it = t111 
os > oh ot —> oh 


This system can be seen as encoding the execution of a Turing machine with cells that 
can be contracted/expanded. The symbols 1 and © (blank) form the tape alphabet, while 
the symbols h (half), s (shift), t (triple) indicate the head along with the state of the 
machine. Through the following result, the Collatz conjecture can be reformulated as 
termination of string rewriting. 


Theorem 2 ([29]). Z is terminating if and only if the Collatz conjecture holds. 


While the forward direction of the above theorem is easy to see (since ¢h1?"o —>% 
oh1"o forn > 1 and oh1?"t1o 4% oh19”"*%0 for n > 0), the backward direction is 
far from obvious because not every string corresponds to a valid configuration of the 
underlying machine. 

As another example, consider the system W = {h11 —> 1h, 1ho > 1to,1t > 
t111,°t — oh} (originally due to Zantema*). Termination of this system has yet to be 
proved via automated methods. Nevertheless, there is a simple reason for its termination: 


4 https://www.Iri.fr/~marche/tpdb/tpdb-2.0/SRS/Zantema/z079.srs 
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It simulates iterated application of a partial generalized Collatz function W: NF —> NT 
defined as follows, which is easily seen to be convergent. 


W(n) 3n/2 ifn=0 (mod 2) 
nh) = 
L ifn=1 (mod 2) 

If a proof of the Collatz conjecture is to be produced by some automated method 
that relies on rewriting, then that method better be able to prove a statement as simple as 
the convergence of W. With this in mind, we describe an alternative rewriting system 
that simulates the Collatz function and terminates upon reaching 1. We then provide 
examples where the alternative system is more suitable for use with termination tools 
(for instance allowing an automated proof of the convergence of W). 


3.2 Rewriting in Mixed Base 


In the mixed base scheme, the overall idea is as follows. Given a number n € Nt, we 
write a mixed binary—ternary representation for it (noting that this representation is not 
unique). With this representation, as long as the least significant digit is binary, the parity 
of the number can be recognized by checking only this digit, as opposed to scanning 
the entire string when working in unary. This allows us to easily determine the correct 
case when applying the Collatz function. If the least significant digit is ternary, then 
the representation is rewritten (while preserving its decimal value) to make this digit 
binary. Afterwards, since computing n/2 corresponds to erasing a trailing binary 0 and 
computing 3n + 1 corresponds to inserting a trailing ternary 1, applying the Collatz 
function takes a single rewrite step. We explain this scheme more formally below. 

A mixed base numeral system is a numeral system where the base changes across 
positions, which we define as follows. Note that unary is not a positional numeral system, 
so we require the bases to be greater than 1. 


Definition 8 (Mixed base representation). Let B C Ns, be a set of bases and let 
N = Nib Nb; +-+Nkp, be a string where ni € N. If we have for each 1 < i < k that 
bi € Band0 < n; < b;, then N is called a mixed B-ary representation. 

The string N from above represents the decimal number Nig = YL ni mi- i41 bj 
Observing that the addition of leading zeros to a string does not change its decimal value, 
we may assume without loss of generality that nı > 0. Furthermore, bı does not affect 
the decimal value of the string, so we may omit it. 

Now, define BR (2) = bx + n. After rearranging, we see that the decimal value of 
the B-ary string N = ny np, ...Mxp, May also be written as Nig = (8p? © b 
-0 Bye )(n1). This gives us a string and a function view of the same representation, and 
we will switch between them as appropriate. In doing so, we also conflate the symbols 
and the corresponding functions, referring to 6p as np. 

As the last ingredient before describing the rewriting system, we observe that we can 
write (8708) (a) = bex+bm+n equivalently as another composition (8 08)" )(a) = 
cbx + cn' + m’ for some suitable 0 < n’ < band 0 < m’ < c. This allows us to swap 
the bases of adjacent positions while preserving the decimal value of the string. 


O 
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From this point on, we constrain ourselves to the mixed {2, 3}-ary (binary—ternary) 
representations as we shift our focus to simulating the Collatz function (noting that it 
is possible to adapt the rewriting system that we will end up with to other instances of 
the general case). More precisely, we simulate the following redefinition of the Collatz 
function where the odd case incorporates an additional division by 2. 


2 ifn=0 (mod 2) 
T(n) = 2 
(n) n ifn=1 (mod 2) 


We will describe an SRS 7 over the symbols {f, t, 0, 1, 2,<,>} that simulates 
iterated application of the Collatz function and terminates upon reaching 1. The symbols 
f, t correspond to binary digits 02,12; and 0, 1,2 to ternary digits 03, 13,23. The 
symbol < marks the beginning of a string while also standing for the most significant 
digit (without loss of generality assumed to be 1) and > marks the end of a string. 
Consider the functional view of these symbols: 


O(a) = 3a 
f(x) = 2a oe A(z) = 1 
t(x) = 2x41 ~ _ 5 Bes b(z) = 2 (1) 


Each positive natural number can be expressed as some composition of these functions, 
which corresponds to a string as per our previous discussion. 


Example 3. Allowing the inclusion of a redundant trailing symbol > to mixed base 
representations, we can write 19 = (<0f1>)19 = >(1(£(0(<(x))))). The string rep- 
resentation ends with a ternary symbol, so we will rewrite it. With the function view, 
we have 1(f(x)) = 3(2x) + 1 = 6a +1 = 2(3x)+ 1 = t(0(x)). This shows 
that we could also write 19 = (<00tP)19, which now ends with the binary digit 12. 
This gives us the rewrite rule f1 — Ot. We can now apply the Collatz function to 
this representation by rewriting only the rightmost two symbols of the string since 
T(>(t(x))) = ae = St! = 3x +2 = (>(2(x))). This gives us the rewrite 
rule t> — 2p. After applying this rule, we indeed obtain T(19) = 29 = (<002>) 0. 


In the manner of the above example, we compute all the necessary transformations 
and obtain the following 11-rule SRS 7. 


EE f0 —> 0f t0—> 1t <0 > <t 
i a A=< fl > 0t tl > 2f B= 41 > «ff 
f2 => 1f t2> 2t 42 > 4ft 


This SRS is split into subsystems Dr (dynamic rules for T) and ¥ = AU B (auxiliary 
rules). The two rules in Dy encode the application of the Collatz function T, while 
the rules in X serve to push binary symbols towards the rightmost end of the string by 
swapping the bases of adjacent positions without changing the represented value. 


Example 4 (Rewrite sequence of T). Consider the string s = <ff0> that represents 
the number 12. Below is a possible rewrite sequence of 7 that starts from s, with the 
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corresponding decimal values (under the interpretations from (1)) displayed above the 
strings. Underlines indicate the parts of the strings where the rules are applied. 


12 12 6 6 3 3 5 
<ff0> >, <4f0fè >p, IE0> 4,4 <0fè >p, 0> >g <t> Sp, 12> 


5 8 8 8 4 2 1 
>g <4ftè >D, <£2> 4 <4lfòè >g 4fffè >Dr <ff> Sp, sf >Dr <> 


The trajectory of T continues upon reaching 1, however, in order to be able to formulate 
the Collatz conjecture as a termination problem, 7 is made in such a way that its rewrite 
sequences stop upon reaching the string representation <> of 1 since no rule is applicable. 

Termination of the subsystems of 7 with B or Dr removed is easily seen. However, 
since we have matrix interpretations at our disposal, let us give a compact proof. 


Lemma 3. SN(7 \ B) and SN(T \ Dr). 


Proof. It is easily checked that the interpretations below show SN((T \ B)'*”), which 
implies SN(7 \ B) by Lemma 2. 


[f](x) =[e](@)=2e@+1  fpl=x = [O(x) =[1](2) = [2] (x) = 2x 


Below interpretations show SN((T \ Dr)"*”), which implies SN(T \ Dr) by Lemma 2. 
[A](a) =[e](z) =[s](@)=2+1 [x)= [1](2) = [2](x) = 4x 


As a whole, the system 7 simulates the iterated application of T (except at 1). 


Theorem 3. 7 is terminating if and only if T is convergent. 


Proof (sketch). We observe that the rules of 7 do not change the number of occurrences 
of < or > in a string and that the rewrite sequences operate strictly on one side of these 
symbols. Thus, we may view a given string as split into blocks delimited by < or > 
and consider the termination of each block separately. In this way, we conclude that 
there exists a nonterminating rewrite sequence for a string if and only if it contains a 
block of the canonical form <(£|t|0|1|2)*> that can be rewritten indefinitely, since the 
rewrite sequences that start on blocks of all other forms are already seen to terminate by 
Lemma 3. Furthermore, under the interpretations in (1), the sequences of values attained 
by the rewrites of the blocks in canonical form correspond directly to Collatz trajectories, 
since the rules in ¥ do not change the value of the block and the rules in Dr change the 
value of the block in exactly the same way as the Collatz function T. 


When trying to remove a rule in Dy or B it suffices to show relative top termination, 
allowing us to use weakly (instead of extended) monotone algebras when applying 
Theorem | and take advantage of the more relaxed constraints when searching for 
matrix/arctic interpretations. The lemma below encapsulates dependency pairs, and it 
can in fact be automatically proved via the dependency pair framework [9]. 


Lemma 4. For each subset R C B, if SN(Rtop/T) then SN(R/T). And, for each 
subset R C Dr, if SN(RES /T?) then SN(R'* /T?”), 


top 
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Proof (sketch). Without loss of generality, assume we start with a string of the canonical 
form <(£|t|0|1|2)*> (resp. its reversal). Then, the rules in B (resp. Dr™®”) can only 
be applied at the top level. As we know from Lemma 3 that T \ B (resp. T \ Dr) is 
terminating, any infinite sequence of rewrites in 7 (resp. its reversal) would require 
infinitely many applications of the rules from 6 (resp. D*°”). As these rules can only 
be applied at the top level, this would imply relative top nontermination. 


4 Automated Proofs 


We adapt the rewriting system 7 to different generalized Collatz functions to explore the 
effectiveness of the mixed base scheme on weakened variants of the Collatz conjecture. 
The rewriting systems, scripts to reproduce the experiments, and our implementation of 
a termination prover are available at https://github.com/emreyolcu/rewriting-collatz. 

Most top-tier termination tools, such as AProVE, Matchbox, and TyTo9, use the SAT 
solver MiniSat [5] to search for matrix/arctic interpretations. This choice is somewhat 
surprising as MiniSat has not been updated since 2008 and the performance of SAT 
solvers has improved significantly in the last decade. The use of MiniSat in these provers 
is motivated by its observed effectiveness in finding interpretations. We investigated the 
reason for this, which turned out to be a heuristic that MiniSat disables in its default 
configuration. MiniSat uses negative branching [5], which explores the “false” branch 
first for all decision variables. Modern SAT solvers use phase-saving [22] which first 
explores the branch corresponding to the truth value to which the variable was forced to 
most recently during unit propagation. In our case, enabling negative branching improves 
solver performance for formulas that encode the existence of interpretations. 


4.1 Convergence of W 


With the mixed binary—ternary scheme, the function W from Section 3.1 can be seen 
to be simulated by the system W’ = {f> — 0>} U ¥. A small matrix interpretations 
proof is found for this system in less than a second, in contrast to its variant W that uses 
unary representations for which no automated proof is known. 


Theorem 4. SN(W’). 
Proof. The interpretations below prove SN({> f — PO} /AXT°"): 


[£](z) = k i Z+ Hl Elz) = k i Z+ a 


[0\(#) = i Z+ fo [1|(#) = i 4 Z+ | [2\(#) = i 4 Z+ | 


By Lemmas 3 and 2, ¥"°Y is terminating. As a result, W’™®Y is terminating, which 
by Lemma 2 implies that W’ is terminating. 
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4.2 Farkas’ Variant 


Let 2N + 1 = {1,3,5,...} denote the odd natural numbers. Farkas [8] studied a slight 
modification F”: 2N +1 — 2N + 1 of the Collatz function which can be proved 
convergent via induction. We consider automatically proving the convergence of this 
function as another test case for the mixed base scheme that is easier than the Collatz 
conjecture without being entirely trivial. We refer the reader to [8] for the original 
definition of F’. Below, we define another function F: N —> N that resembles the 
Collatz function more closely than Farkas’ F” (with respect to the definitions of the 
cases) while being equivalent to F’ in terms of convergence. This variant is obtained by 
introducing an additional case in the Collatz function for n = 1 (mod 3) and applying 
T otherwise. Its definition and a set Dp of dynamic rules are shown below. 


a Ip + > 

t= ifn=1 (mod 3) Of> > OD 

F(n)=4 5 ifn =Oorn=2 (mod 6) Dr=< 1f> > Ip 
3ntl ifn =3o0rn=5 (mod 6) an = — 
tè => 22> 


Termination of the rewriting system F = Dp U X is equivalent to the convergence of F. 
The proof of the equivalence is essentially the same as that of Theorem 3. Farkas gave 
an inductive proof of convergence for F’ via case analysis, and we found an automated 
proof that F is terminating via arctic interpretations. It is worth mentioning that the 
default configurations of the existing termination tools (e.g., AProVE, Matchbox) are 
too conservative to prove termination of this system, but after their authors tweaked the 
strategies they were also able to find automated proofs via arctic interpretations. 


Theorem 5. For all n € NY, the trajectory F, (n) contains 1. 


Proof. We will show SN(F). By Lemmas 3 and 2, we have SN(¥'°Y). The arctic in- 
terpretations below (with the empty cells standing for —oo) prove SN (D riep/ X") by 
Theorem 1, which implies SN(Dp™®Y /XT°’) by Lemma 4. As we know 4’'° is termi- 


nating, by Lemma 1 we conclude SN(Dp™Y U X"), implying SN(F) via Lemma 2. 


2 0 2 0 
20 020 0 
AE = |2 z+ [t(z)=|2 2 Z+ 
0 0 
2 
[<](z) = [>](z) = z 
4 
040 1 0 0 
4 40 4 
[(z@)=] 40 z B=] 40 z ([2(z)=|/0 1 olz 
030 0 
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4.3 Subsets of T 


It is also interesting to consider whether we can automatically prove terminations of 
proper subsets of 7. Specifically, we considered the 11 subsystems obtained by leaving 
out a single rewriting rule from 7, and we found proofs via matrix/arctic interpretations 
for all of the 11 subproblems. The reason for our interest in these problems is threefold: 


1. Termination of 7 implies the terminations of all of its subsystems, so proving its 
termination is at least as difficult a task as proving terminations of the 11 subsystems. 
Therefore, the subproblems serve as additional sanity checks that an automated 
approach aspiring to succeed for the Collatz conjecture ought to be able to pass. 

2. When proving termination in a stepwise manner, we solve a sequence of relative 
termination problems. Having proved the terminations of all 11 subsystems is a 
partial solution to the full problem, since it implies that for any single rule > r € 
T, proving SN({¢ + r}/T) settles the Collatz conjecture. 

3. After the removal of a rule, the termination of the remaining system still encodes 
a valid mathematical question about the Collatz trajectories. The question of ter- 
mination of a proper subset is equivalent to asking if every corresponding Collatz 
trajectory that does not require the use of the left-out rule is convergent. 


Example 5. As an instance of leaving out a rule, consider the subsystem 7 \ {£1 — Ot}. 
There is a single-step matrix interpretations proof that this system is terminating: 


(0\(2) = : | z4 H [1(@) = Fi i Z+ fol A(z) = i I D | 


With the above interpretations, we can show for instance that the Collatz trajectory 
starting at 3 (represented as <tr>) is convergent, because the missing rule is not used in 
any derivation of 1 (<>) from 3. Below is an example derivation along with the decimal 
values each string represents and a vector value of each string under the interpretations 
above (setting z = (0,0) for the purpose of demonstration). We omit the subscripts from 
the rewrite relations and simply write —. 


3 5 5 8 8 8 4 2 1 
<tp > 42> > <4ftè > <£2> > <1f> > 4fffè > <ffe > 4ıfòè > <p 


79 s 78 z 68 x 62 5 41 S 40 $ 26 - 14 = 12 
0 0 0 0 0 0 0 0 0 
Table 1 shows the parameters for the proofs that we found for the termination of each 


subsystem. For each rule £ — r that is left out, we searched for a stepwise proof to show 
that 6\{¢ > r} is terminating relative to 7 \ {¢ — r} (freely utilizing weakly monotone 
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Table 1. Smallest proofs found for terminations of subsystems of 7 in under 120 seconds. The 
columns show the matrix dimension d and the maximum number v of distinct coefficients that 
appear in the matrices, along with the median time to find an entire termination proof across 10 
repetitions for the fixed d and v. 


Matrix Arctic Matrix Arctic 


Rule removed d v Time d v Time Rule removed d v Time d v Time 
fò —> > 34 4s 35 19s f0 — Of 4 2 ls 34 3s 
tè > 2> 12 <ls 13 <ls fi => 0t 13 Is 14 Is 
40 > <t 22 <Ils 23 <ls TA he AS Tosh 
<11 —> <ff 33 ls 34 Is tO 1t 4 3 2s 34 Is 
42 > aft 44 8s 43 4s tl— 2f 52 Is 4 3 Is 

t2—> 2t 44 28 25 Is 


algebras due to Lemma 4). Such a proof requires at most three steps since there are at 
most three rules in B \ {¢ + r}. On the table, we report the smallest parameters (in 
terms of matrix dimension) that work for all of these steps. As we already know that 
SN(T \ B) holds (by Lemma 3), the interpretations found allow us to conclude the 
termination of each subsystem. This is not the only way to prove the terminations of the 
subsystems, however, we chose this uniform strategy for the sake of comparison. 


4.4 Odd Trajectories 


In the originally defined Collatz function C, applying 2n + 1 ++ 6n + 4 produces 
an even number, so we incorporate a single division by 2 into the definition of the 
odd case and obtain the function T with the same overall dynamics as C. Taking 
this idea further by performing as many divisions by 2 as possible leads to the so- 
called Syracuse function Syr: 2N + 1 — 2N + 1, defined as Syr(n) = 24¢* where 
k = max{k € Nt | 2° divides 3n + 1}. 

Expressing the Syracuse function as a generalized Collatz function would require 
infinitely many cases to account for all of the possible appearances of 2” as the denomi- 
nator with different values of k. As a result, we are unable to simulate it with a finite 
rewriting system. Nevertheless, we may compromise and accelerate the Collatz function 
by a constant amount. We first observe that if n = 1 (mod 8) then Syr(n) = 2%+* 
and if n = 3 (mod 4) then Syr(n) = 241. Furthermore, for any n € N we have 
Syr(8n + 5) = Syr(2n + 1) since 3(8n + 5) + 1 = 24n +16 = 4(6n + 4) = 
4(3(2n + 1) + 1). Putting these observations together, we can define a generalized 
Collatz function S: 2N + 1 — 2N + 1 as follows. 


antl ifn=1 (mod 8) 
S(n)=4 "2 ifn=5 (mod 8) 
intl ifn=3 (mod 4) 


S is convergent if and only if C (or T) is convergent, and the number of steps that 
S takes to converge is between that of T and Syr. In a manner similar to before, we 
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<< (6) 


Sy & 


Fig. 1. Transition graphs of the iterates in the Collatz trajectories across residue classes modulo 8 
for the functions C (left), T (middle), S (right). For each function f, the edge u — v is part of its 
transition graph if and only if there exists some n = u (mod 8) such that f(n) =v (mod 8). 
Bold edges indicate transitions where f(n) > n. 


can translate S into a rewriting system S = {ffe — 0e, tfe +e, te —> 2e} U X. 
Since we are working with odd numbers we used a new symbol e to mark the end of 
a string, viewed functionally as e(a) = 2x + 1. Termination of the rewriting system 
S is equivalent to the convergence of S. Similar to 7, proving the termination of S is 
currently beyond our reach, although it may potentially be an easier path to the Collatz 
conjecture (compared to proving SN(T )). Failing to prove the termination of S itself, 
we considered the subsystems of S as we did for 7 in Section 4.3. With matrix/arctic 
interpretations, the terminations of all but two of the 11-rule subsystems of S were 
automatically proved. Despite devoting thousands of CPU hours, we were not able to 
find interpretations to prove that Sı = S \ {ffe — 0e} or S2 = S \ {tfe — e} is 
terminating, so we leave them as challenges for automated termination proving. 


4.5 Collatz Trajectories Modulo 8 


Let m be a power of 2. Given k € {0,1,...,m— 1}, is it the case that all nonconvergent 
Collatz trajectories contain some n = k (mod m)? For several values of k this can be 
proved to hold by inspecting the transitions of the iterates in the Collatz trajectories 
across residue classes modulo m (shown on Figure 1 for m = 8). These questions can 
also be formulated as the terminations of some rewriting systems. With this approach we 
found automated proofs for several cases: 


Theorem 6. Zf there exists a nonconvergent Collatz trajectory, it cannot avoid the 
residue classes of 2, 3, 4, 6 modulo 8. 


It remains open whether the above holds for the residue classes of 0, 1, 5, 7 modulo 8. 


5 More Problems to Approach via Rewriting 


Mahler’s 3/2 Problem. Let € € Rso be a real number. It is called a Z-number if for 


all k € N we have frac (£ (3)") < L, where frac(-) denotes the fractional part of the 
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number. Mahler [20] conjectured that there are no Z-numbers. Moreover, he considered 
a generalized Collatz function M : Nt — N+, defined as follows. 


32 ifn =0 (mod 2) 
M(n) = 4 224% ifn=1 (mod 4) 
£ ifn=3 (mod 4) 


He related the behaviors of M-trajectories to the existence of Z-numbers: 


Theorem 7. Forn € Nt, if a Z-number exists in the interval [n,n + 1), then there is 
no k € N for which M*(n) = 3 (mod 4). 

Thus, the nonexistence of Z-numbers can be established by proving that M is convergent, 
which is equivalent to the termination of M = {f> > Op, ft> + 10>} U X. In order 
to ensure termination at the case n = 3 (mod 4), there is no rule with the LHS ttp. 


Halting Problem for Busy Beaver-5. The busy beaver problem concerns finding binary- 
alphabet Turing machines with n states that, when given an input tape of all Os, write 
the largest number of 1s on the tape upon halting. For each n, the machine that achieves 
this is called the “Busy Beaver-n’’. Note that this definition only requires the machines 
to halt on all-0 inputs, leaving the behavior on other inputs unspecified and allowing 
them not to halt in general. Michel [21] observed that for n € {2,3, 4}, the busy beaver 
machines are all total Turing machines, i.e., they halt on all inputs, and moreover proved 
that they all simulate some generalized Collatz function. It is an open problem whether 
all busy beavers are total. In particular, it is unknown whether the current Busy Beaver- 
5 candidate is total. Michel showed that the Busy Beaver-5 candidate simulates the 
following generalized Collatz function. 


Sntt8 ifn =0 (mod 3) 
B(n) = 4 42 ifn =1 (mod 3) 
L ifn=2 (mod 3) 


Convergence of the above function can be studied via the termination of a rewriting 
system obtained by a mixed {3,5 }-ary (ternary—quinary) translation scheme. We were 
unable to prove the termination of the resulting system. 


Ternary Expansions of 2". Erdős [7] asked: When does the ternary expansion of 2” 
omit the digit 2? This is the case for 2° = (1)3, 2? = (11)3, and 28 = (100111)3. He 
conjectured that it does not happen for n > 8. This conjecture can be proved by showing 
that the rewriting system E = { 0> > >, 1> > >, <> > <} U {r > L| lore “fis 
terminating on all initial strings of the form <f f*>. Given a string that corresponds to 
the binary representation of a power of 2, this system essentially rewrites the string into 
ternary by pushing ternary symbols to the right without altering the value that the string 
represents, and removes the occurrences of the ternary digits 0 and 1 (but not 2). If the 
ternary expansion does not contain the digit 2 then all digits will be removed, resulting in 
the string <I> that can then be rewritten to itself indefinitely. This problem, as described, 
is an instance of “local termination” [28] since it is concerned with termination on not 
all possible strings but a subset of them. We have not performed experiments with this 
system or local termination yet and we leave this for future work. 
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6 Related Work 


To our knowledge, Zantema [29], with his system Z that we saw in Section 3.1, was the 
first to attempt using an automated method and string rewriting to search for a proof of 
the Collatz conjecture. In addition, although we independently discovered the mixed 
binary—ternary system described in Section 3.2, Scollo [25] had essentially the same 
idea, the difference being that he adopted a functional view of the digits that is slightly 
different than in (1). Scollo was not concerned with proving termination, though, and 
proposed rewriting primarily as a formalism that forgoes the arithmetic interpretation of 
the iterates and instead emphasizes its dynamic/computational behavior. 

De Mol [4] showed the existence of a small 2-tag system [23] with the following rules 
that simulates the iterated application of the Collatz function given a unary representation: 
{1 — <>, < — 1,> — 111}. This tag system halts if and only if the Collatz conjecture 
holds, giving yet another formulation of the problem. 

Kari [11] designed 1D cellular automata that perform multiplication by 3 and 3/2 in 
base 6, and reformulated both the Collatz conjecture and Mahler’s 3/2 problem as sets 
of constraints to be satisfied by the space-time diagrams of these cellular automata. 

Kauffman [13] developed a formalism to perform arithmetic that he called string 
arithmetic, and expressed the Collatz conjecture within it. This formalism works with 
unary representations of numbers, and uses the three symbols 1, <, >. Letting € denote 
the empty string and N be any string representing a number, string arithmetic consists of 
the following bidirectional rewrite rules (or “identities”) to convert between different 
strings representing the same number: {>< <> €, 11 <> <1p, 1N 4> N1}. Then, the 
Collatz function is encoded by the following two rules: {<N> — N, 4N> 1 —> <NIDN}. 
The Collatz conjecture is equivalent to the question of whether for strings of 1s of all 
lengths there exists a rewrite sequence using the five rules above to reach the string 1. 


7 Future Work 


Several extensions to this work can further our understanding of the potential of rewriting 
techniques for answering mathematical questions. For instance, although matrix/arctic 
interpretations lead to automated proofs of several weakened variants discussed in 
this paper, it might still be the case that there exists no matrix/arctic interpretation to 
establish the termination of the Collatz system 7. Proving nonexistence would provide 
guidance as to where to focus our efforts when searching for a proof. Another issue 
is the matter of representation, specifically, it is worth exploring whether there exists 
a suitable translation of the Collatz conjecture into a term, instead of string, rewriting 
system since many automated termination proving techniques are generalized to term 
rewriting. Finally, injecting problem-specific knowledge into the rewriting systems or 
the termination techniques would be helpful as there exists a wealth of information about 
the Collatz conjecture that could simplify proof search. 
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Abstract. Symbolic computation is involved in many areas of math- 
ematics, as well as in analysis of physical systems in science and en- 
gineering. Computer algebra systems present an easy-to-use interface 
for performing these calculations, but do not provide strong guarantees 
of correctness. In contrast, interactive theorem proving provides much 
stronger guarantees of correctness, but requires more time and exper- 
tise. In this paper, we propose a general framework for combining these 
two methods, and demonstrate it using computation of definite integrals. 
It allows the user to carry out step-by-step computations in a familiar 
user interface, while also verifying the computation by translating it to 
proofs in higher-order logic. The system consists of an intermediate lan- 
guage for recording computations, proof automation for simplification 
and inequality checking, and heuristic integration methods. A prototype 
is implemented in Python based on HolPy, and tested on a large collec- 
tion of examples at the undergraduate level. 


Keywords: Symbolic integration, User interface, Proof automation 


1 Introduction 


Symbolic computation is an important tool in mathematics, science, and engi- 
neering. It forms a key part of many mathematical proofs. On the engineering 
side, justifications for the design of signal processing and control systems con- 
tain extensive symbolic computations [633], involving derivatives and integrals, 
Laplace and Fourier transforms, and various special functions. 

Typically, these computations can be performed using computer algebra sys- 
tems such as Mathematica, Maple, and Maxima. Given the complexity of the 
task, it is not surprising that even the best of these systems are liable to errors. 
One famous example is i; Vx? dx, which an early version of Maple evaluates to 
zero [23] (the error has been fixed in the more recent versions). Bugs in Math- 
ematica have also been observed by mathematicians [15], including evaluation 
of determinants of matrices with large integer entries, and several evaluations 
of integrals (also fixed in the most recent version). While some errors are sim- 
ply implementation mistakes, more systematic errors in symbolic computation 
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may arise due to neglect of checking side conditions, involving concepts such 
as well-definedness of expressions, singularities, convergence, and so on. While 
individual bugs can be reported and fixed, completely eliminating the possibility 
of error would require a more systematic approach. 


Formalization of mathematics in interactive theorem provers promises to 
eventually achieve this goal. There is already a lot of work on formalization 
of analysis and linear algebra in interactive theorem provers, as well as veri- 
fied computations based on the formalized theories. They provide much stronger 
guarantees of correctness, and also allow users to specify more detailed steps, en- 
abling computations that are too difficult to be found automatically by computer 
algebra systems. However, a major disadvantage (for now) is that interactive the- 
orem proving requires a great deal of time and expertise on the part of the user, 
making it difficult to apply on a much larger scale. 


It is therefore natural to try to combine the advantages of computer algebra 
systems with theorem proving. There have already been many works in this 
direction. A common approach, proposed by Harrison and Théry [2023], is to 
invoke a computer algebra system for computations that are difficult to perform, 
but whose results can be verified more easily. This greatly extends the capability 
of proof assistants for tasks such as factorization [23], linear arithmetic [28], 
etc. However, to use such a system, the user still needs expertise in the use of 
proof assistants, and the range of applicability is limited by the simple proof 
automation that is available for checking results. 


In this paper, we propose a more general framework for verified symbolic com- 
putation in theorem provers, and demonstrate it using computation of definite 
integrals. The resulting system allows users to perform calculations of definite 
integrals step-by-step, in a user interface similar to that of a computer algebra 
system, but with the computations verified by automatic translation to proofs 
in higher-order logic. We choose definite integration for demonstration purposes, 
due to the great variety of techniques that can be used, but we intend the idea 
to be applicable to other kinds of symbolic computations. 


The framework consists of several components. At the top, a graphical user 
interface displays the current computation and allows user actions. The user 
interface produces computations in a standard format. Next, proof automation is 
used to reconstruct from the computation a proof in higher-order logic. Finally, 
the proof depends on theorems in mathematics, e.g. (in the case of definite 
integration) those concerning continuity, derivatives, and integrals. 


We implement a prototype based on HolPy, a new interactive theorem prover 
written in Python [49]. The SymPy package for symbolic computation in Python 
is used at various places for untrusted computations. The user interface is written 
in JavaScript as a web application, using Python as backend for convenient 
invocation of HolPy and SymPy libraries. The underlying theorems in analysis 
are mostly translated to HolPy from HOL Light (with some modifications). Their 
proofs have not been fully formalized in HolPy, hence the statements of these 
theorems still need to be trusted. 
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We now give an outline for the rest of this paper. Section P] presents the 
overall framework. Section |3| describes the intermediate format for recording 
computations of definite integrals. In Section [4] and we describe respectively 
the user interface and the proof reconstruction process. In Section [6] we present 
an evaluation of the system, along with some interesting examples. Finally, we 
conclude in Section [7] with discussion of possible future work. 


Related work. There is a huge body of work on formal verification of continuous 
and hybrid systems, based on reachability checking [4], computation of invari- 
ants [36]41], deductive methods [84]35/47], and so on. In particular, KeYmaera 
X provides a user interface for verifying hybrid systems using differential dy- 
namic logic, with automatic generation of proofs checkable in Isabelle [9]. Most 
of this work focuses on automatic verification and/or logical formalisms. Our 
work can be seen as complementary, focusing on verifying symbolic reasoning 
about mathematical concepts such as special functions and integration, which 
can also form a part of the justification of control systems. 

Harrison and Théry proposed the “skeptical” approach for combining theo- 
rem provers with computer algebra systems [2023]. Some common applications 
include factorization of polynomials, which is further applied to verify antideriva- 
tives involving sine and cosine [23]. More recently, this technique is used by 
Chyzak et al. to formalize the proof of irrationality of ¢(3) [4], and by Harrison 
to verify proofs of hypergeometric sums found using the WZ method [22]. Similar 
approaches are implemented in Isabelle [8], PVS [3] and Lean [28]. Compared to 
this work, we present more complex proof automation for reconstructing proofs, 
as well as a user interface for allowing users to perform multi-step computations 
in a more familiar setting. Other user interfaces for proof assistants with support 
for displaying mathematical computations include Theorema and jsCoq [5]. 

The theory of integration has been formalized in every major proof assis- 
tant [12]24]37/40/43). Recently, more advanced concepts that are important in 
science and engineering have been formalized, including the work by Hasan et 
al. on Fourier and Laplace transforms [37[38]46], and Immler et al. on ordinary 
differential equations [2526]. Work has also been done on formalizing advanced 
concepts in linear algebra [29], with applications in analyzing mechanical sys- 
tems [13]44]. Of course, formalized symbolic computation can be applied in many 
other domains. For example, Selsam et al. [42] verified in Lean the correctness 
of stochastic backpropagation, an important algorithm in deep learning. 

Slagle initiated the study of automatic integration with a heuristic method 
[45]. Later research focused more on methods that are complete for certain types 
of integrands, such as Risch’s algorithm [19]. More recently, Rubi (rule-based in- 
tegration) has been demonstrated to be a powerful technique [39]. However, none 
of these work focuses on formal verification. A verified computation of asymp- 
totics for real-valued functions is implemented by Eberl [16]. Verified numerical 
computation of definite integrals is implemented by Mahboubi et al. [80]. 


Acknowledgements. This work was partially supported by the National Natural 
Science Foundation of China under Grant Nos. 62002351, 62032024, and the 


488 R. Xu, L. Li, et al. 


Chinese Academy of Sciences Pioneer 100 Talents Program under Grant No. 
Y9RC585036. 


2 Overall Architecture 


In this section, we describe the overall architecture of the system, leaving descrip- 
tions of its components to the following sections. We focus on definite integrals 
of continuous functions in one variable over closed intervals. In particular, we 
consider expressions given by the following syntax: 


e :=v | c| e1 op e2 | f(e) | Deriv(e, v) | Integral(e, v, a, b) 


Here v is a variable; c is a constant (either a rational number or 7); op is an 
arithmetic operation (+,—,x,+ and exponentiation); f is a special function 
(such as logarithms, exponentials, or trigonometric functions); Deriv(e,v) de- 
notes the derivative of e with respect to variable v; Integral(e, v, a, b) denotes the 
definite integral of e with respect to variable v over the interval [a,b]. In the rest 
of this paper, we will use both concrete syntax and IATfX form of expressions. 
We use locations to point to particular subexpressions. A location is given by 
a sequence of natural numbers (written in the form nj.ng...ng%, with each ni 
starting from zero), specifying the path to a subtree in the abstract syntax tree 
of an expression. For example, in the expression 


1+ Integral(1 + sin? (x), a, 0, 1) 


the location of sin? (x) is given by 1.0.1. 

A computation is represented as a list of steps, with each step specifying a 
rewriting of the current expression. Each step should provide sufficient informa- 
tion so that both checking its correctness and proof generation can be performed 
relatively easily. A computation begins with the integral to be evaluated, and 
ends with an expression in simplified closed form. Each step contains the name 
of the rule used, the location in the expression at which it is applied, and the 
expected result of applying the step. A step may contain additional parameters 
and certificates needed for verification. Rules of integration include substitution, 
integration by parts, use of a trigonometric identity, and so on (described in 
detail in Section B). For example, integration by parts takes as parameters two 
expressions u and v, such that f -dx = u- dv where f is the integrand of the 
integral at the given location. 

A graphical user interface allows the user to specify a computation in ways 
similar to using a computer algebra system. The user interface displays the 
computation in TFX or in text form. At each step, the user selects part of the 
current expression to focus on, then selects an action from the menu. Depending 
on the selected action, the user may need to enter some of the parameters, while 
the other parameters are automatically inferred by the system. After checking 
validity of inputs, the user interface computes the result of the action. A package 
for symbolic computation may be invoked at this step. 
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There are many side conditions that need to hold in order for a computa- 
tion step to be correct, some of which may not be caught at the user interface. 
Translation of the computation to proofs in higher-order logic greatly increases 
our confidence in the computation and can point out potential errors. In this 
work, we translate the computation to higher-order logic proofs in HolPy. One 
main difficulty is implementing sufficiently powerful proof automation for sim- 
plification of expressions, inequality checking, and other side conditions. We 
demonstrate that the API for proof automation in HolPy is sufficiently powerful 
for this purpose. However, note the representation of a computation is indepen- 
dent from any particular proof assistant, so additional proof translation may be 
implemented for other proof assistants. 

Finally, various algorithms for integration (such as Slagle’s method [45]) may 
be implemented to perform several steps of computation at once. We imple- 
mented Slagle’s method and have it as one of the options at the user interface. 

The overall framework is shown in the following diagram. 


User interface Slagle’s method | Other algorithms 


a Cen. 


Computation 


‘HolPy oe 


Proof automation 4 ET Saa 
[ -  Tsabelle Coq 


Soe ee NS oe oe i 
| 
I 


Analysis library 


| 
| 
Lee hehe SY ence eee 


Here solid boxes and arrows indicate parts that are implemented for this paper. 
The analysis library is only partially formalized. Dotted arrows indicate possible 
future extensions. 

This layered design can be viewed as a separation of concerns. At the top, the 
user only need to think about how to evaluate an integral in general mathemati- 
cal terms. The implementation of integration algorithms only involves computer 
algebra. Proof automation involves algorithms for constructing proofs in the un- 
derlying logic. Finally, building a library in analysis involves working with a 
proof assistant. All these are put together to enable verification of potentially 
difficult symbolic integration by producing proofs in higher-order logic or other 
logical formalisms. In the following three sections, we describe the top three 
layers of the system in more detail. 


3 Integration Rules 


Rules of integration define the language for recording computations. Each rule 
may take additional parameters (as described below), as well as a location pa- 
rameter specifying the subexpression the rule is applied on. 
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3.1 Simplification 


The rule Simplification rewrites an expression to an equivalent simpler form. 
The details of simplification depends on the implementation. Here we only spec- 
ify in broad terms what is and is not simplified. These choices are made mainly 
considering the ease of performing simplifications, and having a clearly defined 
“simplified form”. We do expand products of polynomials and combine terms 
(e.g. from (x + 1)(a — 1) to z? — 1). We do not reduce quotients of a 
(e.g. from (x? + 1)/(x? + 1) to z — (x — 1)/(x? + 1), and from 2/(x? — 1) to 
1/(x — 1) — 1/(x + 1)). We do not automatically expand powers (e.g. (a + 1)°). 
We do simplify values of trigonometric functions (e.g. from sin(4) to v2/2, and 
from sin(5 — x) to cosx), but do not use other trigonometric ‘lentes, We do 
evaluate derivatives and apply a fixed list of basic integrals, including linearity, 
powers, sine, cosine, exponential, and derivatives of trigonometric functions. 

One complication is that certain rewrite rules contain side conditions. For 
example, it is only possible to simplify /æy to yz - \/y when both x and y are 
nonnegative. Likewise (x2)? can be simplified to x22 = x only if x is nonnega- 
tive (otherwise the mistake mentioned in the introduction would result). When 
simplifying an integrand of an integral in x, we assume that x is within the open 
domain of integration, and perform simplification only if it is allowed by this 
assumption. 


3.2 Trigonometric Identities 


Application of trigonometric identities can be very tricky. It is often necessary 
to use trigonometric identities to rewrite an expression to a more complex form, 
in order to prepare for a substitution or integration by parts. 

We use the classification of trigonometric identities by Fu et al. [I7], which is 
implemented in SymPy (sympy.simplify.fu). In this scheme, trigonometric identi- 
ties are classified into several groups with names of the form TRi. Some com- 
monly used groups are shown below (rewriting from left to right): 

— TR5: sin? « = 1 — cos? z. 

— TR6: cos? z = 1 — sin? z. 

— TR7: cos? x = $(1+ cos 2x). 

— TR9: sin z + siny = 2sin ($4) cos (454 n etc. 

— TR11: sin 2x = 2sinx cos x, cos 2x = cos? x — sin? z, etc. 


The Rewrite trigonometric rule rewrites using one group of trigonometric 
identities, followed by simplification. It takes a parameter rule which specifies 
the name of the rule used. For example, applying with rule = TR5 on 2— 2 sin? x 
yields 2 cos? z. 


3.3 Substitution 


Substitution makes use of the following theorem known from first-year calculus: 


a f(g x) dx = E f(u) du. 
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There are two possible directions for applying the theorem, corresponding to two 
rules Substitution I and Substitution II. 


Forward substitution. The rule Substitution I assumes the integral is in 
the form f(g(x))g'(x). Typically in informal writing, only g(x) is provided, and 
f(x) is found by a sometimes magical process. To see the possible complexity 
involved, consider the integral 


hw 

dg 

3 1l-—az-1 

The required substitution is u = y1 -— x. The usual explanation continues as 
follows. Compute du = —4(1 — x)“ t? dx = —4u7! dx. So dx = —2u- du. The 
values of u at the boundary points are 4 and 0. So the integral can be rewritten 


as ie —2u/(u — 1) du = T 2u/(u — 1) du. 

Heurstic methods are needed for finding a suitable function f. Hence, we 
require the Substitution I rule to specify both f and g as parameters. The 
rule checks that f(g(x))g'(x) and the original integrand become the same after 
simplification. We also restrict g to be monotonic (equivalently g'(x) > 0 or 
g'(x) < 0 in the open interval (a, b) E] For example, the previous substitution is 


given by f(u) = 2u/(u — 1) and g(x) = v1 = x. 


Backward substitution. The rule Substitution II applies substitution in 
the other direction. In informal writing, it is usually expressed as substituting 
x by some expression g(t). Then f is the original integrand, but the values of a 
and b need to be found by the reader. Our rule requires specifying a and b so 
that g(a) and g(b) equals the original limits of integration, and g is monotonic 
in the range (a,b). For example, the step 


1 z 
f VI=ade= f* V1- sin?tcost dt 
0 0 


is represented as g = sin(t),a = 0 and b = 7/2. 


3.4 Integration by Parts 


The Integration by parts rule applies the theorem 
b 


i "u(a)o'(x) dz = u(2)o(a)/ - | W@v(a) ae 


Typically in informal writing, both u and v are provided. These are recorded 
as parameters of the rule. The rule checks that f -dx = u- dv, where f is the 
original integrand. For example, the step 


2 2 
f re” de = zea- f e” dx 
—1 —1 
is represented as u = x and v = e”. 


1 It is possible to relax this assumption, but the process for reconstructing the proof 
would be more involved. 
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3.5 Rewriting 


The Rewrite rule provides more flexibility for rewriting than simplification. 
It allows rewriting an expression to any equivalent form as the preparation for 
applying other rules. The rule takes a parameter rhs specifying the intended 
right side of the rewrite, and another expression denom, defaulting to 1. The 
rule checks that denom is nonzero in the domain of integration, and the original 
expression and rhs have the same simplification after multiplying by denom. 

The presence of denom means polynomial division and partial fraction de- 
composition can be specified. For example, when integrating x?/(x? + 1), the 
first step is to divide the numerator by the denominator, yielding x—x/(x? +1). 
Simplification as we have implemented is not strongly enough to show their 
equivalence. However, after multiplying both sides by denom = x? + 1, the ex- 
pressions x? and x(x” + 1) — x become the same after simplification. 


3.6 Splitting an Integral 


Sometimes it is necessary to split the domain of integration into two or more 
parts. This is needed to deal with absolute values, and non-monotonic functions g 
in a substitution. The rule Split region takes a parameter c satisfying a < c < b, 
and split the integral f? f(x) dx into [© f(x) dx +f f(x) dx. For example, when 
integrating Si Vu? dx (the example from the introduction), the first step is 
to split with c = 0, resulting in J Va? dx + J Vax? dx, which can then be 
simplified to Je —a dx + J zdz. 

3.7 Solving Equations 


One particularly interesting technique for integration involves solving for the 
value of the integral in an equatior?} If an integral J can be written in the form 
X — cI, where X is any expression (containing no or simpler integrals), and c is 
a constant not equal to —1, then we can solve the equation J = X — cI to obtain 
I = X/(c+1). Common uses of this technique include integrating expressions of 
the form e°” sin bx and e°” cos ba (apply integration by parts twice, then solve 
equation). The rule Solve equation is applied only to the whole expression, 
and takes two parameters: the index id of a previous step and a coefficient 
coeff. Let I be the integral before step id. The rule adds coeff - I to the current 
expression, then divide by coeff+ 1 and simplify. For example, in the evaluation 
of r” e?” cosa dx, after some steps we get —2 + e" — 4 p e?” cos x dx. Then, 
applying Solve equation with id = 1 and coeff = 4 yields the answer = (—2+e"). 


4 User Interface 


Above the level of representation of a computation, the graphical user interface 
helps the user to specify a computation in several ways. Compared to editing a 
computation directly, the user interface provides the following conveniences: 


? This is valid as long as the integral exists. In our setting this holds as long as the 
integrand is continuous. 
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— Display of all expressions in ATEX format. 

— Selection of actions and subexpressions to perform the action on. 

— Automatically generate some parameters of steps. 

— Access to automatic integration algorithms such as Slagle’s method. 


In the remainder of this section, we describe the last two functionalities in 
more detail. A screenshot of the user interface is shown in Figure 


EEIE a Step 1: 7/1 sin (202) tsin (192), initial 
5 (U A 
100 sin (20z)+sin (192 cos (202) cos (192) 
i ae dx Step 2: [7/10 sin (4x392) ae Rewrite tri tri sin (20z)+sin (19x) t sin ($392) 
Exercise 2: ep 2: fo cos (4392) ewrite trigonometrie os (202)-+c0s (192) °° cos (daa) Y 
1 eX +c0s (2) 39/2007 O_o : 1 

h e?+sin (2) dz Step 3: fo zmagi (u) du Substitute u for 5 X 3927 

Step 4: fee (80/2007) 2 dv Substitute v fcr cos (u) 7 

Step 5: -ġlog (cos (#7) Simplification 4 


Fig. 1. Screenshot of the user interface, showing the computation of Example 2 in 
Section [6] 


4.1 Substitution 


As discussed in Section [3.3] the Substitution I rule requires both f and g as 
parameters, while typically only g is specified in informal arguments. Finding the 
function f can be a nontrivial process. We try two heuristic methods for finding 
f. First, if the substitution u = g(a) can be solved for x, yielding a function 
h such that x = h(u), then f can be found by dividing the integrand by g'(x), 
then substituting h(w) for x and simplify. Both solving and simplification can 
be done without checking well-definedness of intermediate expressions, since in 
the end one only need f(g(x))g'(x) to equal the integrand. For the implementa- 
tion, we use SymPy’s solve function to attempt to find h. The second heuristic 
simply replaces all expressions equal to g(x) by u, then hope that all remaining 
occurrences of x is in a single g'(x) in the numerator. Note that the user can 
always first rewrite the expression into a form where the second heuristic can be 
applied. 


4.2 Rational Functions 


Polynomial division or partial fraction decomposition is a common first step for 
integrating rational functions. From the user interface, the user can invoke these 
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actions. Then SymPy’s apart method is used to obtain the results, For example, 
starting from the integral f r 3 iz dx, the user may choose partial fraction de- 


composition from the menu, which turns the integral into ihe A 5 Ge. FI Wer 


ey dx. The Rewrite rule with appropriate denom parameter is generated 
from this step. 


4.3 Trigonometric Identities 


For the application of trigonometric identities, the user does not need to remem- 
ber names of any rules in Fu’s method. Instead, the user selects a subexpression 
to rewrite. Then, each of Fu’s rules are applied in turn using SymPy. In case the 
application of any rule modifies the expression, the new expression is displayed, 
and the user can select from the displayed options. The selected action is then 
recorded with the corresponding name. 


4.4 Slagle’s Method 


We implement a heuristic integration method due to Slagle [45]. There are two 
main reasons why we choose Slagle’s method. First, it is simple but effective for 
college-level problems. Second, it can output human-readable reasoning steps. 
This method maintains a search tree consisting of AND-nodes and OR-nodes. 
Each node contains an integral, with the root containing the original integral. An 
AND-node specifies that the integral at the node would be solved if each of its 
child nodes are solved. An OR-node specifies that the integral at the node would 
be solved if one of its child nodes is solved. The method iteratively expands the 
tree using a list of algorithmic and heuristic rules. Algorithmic rules involve basic 
normalization operations such as simplification and polynomial division, they are 
always applied to each node. In contrast, heuristic rules are more exploratory, 
such as guessing potential expressions for substitution, and count as one step in 
the search. 

Our implementation is mostly faithful to the original presentation |45|, with 
some modifications to fit better with our framework. The output of Slagle’s 
method (if successful) is a list of applications of algorithmic and heuristic rules. 
Each rule can then be converted to one or more computation steps described in 
Section 


5 Proof Translation 


We now describe the process for translating a computation to a proof in higher- 
order logic. This requires sufficiently strong proof automation for verifying the 
application of each integration rule. The main components of the automation 
include showing two expressions are equal by simplification, inequality checking, 
and showing continuity, differentiability, and integrability of functions. The proof 
automation is implemented in Python based on HolPy. However, it should be 
possible to implement it in other proof assistants, and one aim of this section is 
to provide details to facilitate this process. 
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5.1 Introduction to HolPy 


HolPy [49] is a new system for interactive theorem proving implemented in 
Python. Like Isabelle [32], HOL Light [PI], and HOL4 [I], it uses higher-order 
logic as the logical foundation. The design of HolPy centers around explicit proof 
terms that can be generated and checked as Python objects, and written to a file 
in JSON format. Macros are used pervasively to control the size of proof terms. 
An API for proof automation facilitates implementation of procedures generating 
proof terms, in a manner similar to writing proof automation in the ML family 
of languages, but in the setting of an imperative programming language. 


5.2 Background Library 


For the background library in analysis, we ported statements of over a thousand 
theorems from HOL Light, of which about 40% are proved using the point-and- 
click based user interface [49]. However, major parts of the theory are yet to be 
formalized, including the construction of real numbers, the gauge integral, and 
the fundamental theorem of calculus. At present, the statements of the theorems 
need to be trusted. Finishing the formalization of the analysis library is planned 
as future work. 


5.3 Structure of Proof Automation 


The procedure for translating a computation is as follows. For each step in the 
computation, all expressions involved are first translated into terms in higher- 
order logic. Depending on the rule used, the automation applies the appropriate 
conversion to the input term, with the parameters of the rule serving as addi- 
tional arguments to the conversion. Next, the automation attempts to show the 
equality between the result of the conversion and the expected output of the step 
by simplifying both sides. Hence, there does not need to be perfect agreement in 
the expected output and what is computed by proof automation. The transla- 
tion is successful as long as proof automation is able to show their equivalence. 
In this way, we allow additional flexibility in the implementations. 

We now discuss the overall structure of proof automation, which bears some 
similarity to the structure of auto and simp tactics in Isabelle [48]. We maintain 
two tables: a table of proof rules and a table of simplification rules. Each table 
is indexed by the head of the predicate or term the rule expects. There may be 
multiple rules associated to the same head term. 


— A prove rule for a predicate p takes as input a goal whose head is p and 
a list of assumptions, and attempts to prove the goal. A simple way to 
specify a prove rule is from a list of theorems whose conclusion matches the 
given predicate. The corresponding prove rule attempts to apply each of the 
theorems in order. In case a theorem has assumptions, it recursively applies 
the overall prove procedure (described below) to discharge each assumption. 
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— A simplification rule for a function f takes as input a term whose head is f 
and a list of assumptions, and computes the simplification of the term under 
these assumptions. A simple way to specify a simplification rule is from a list 
of theorems whose conclusion is an equality, where the left side has head f. 
The corresponding simplification rule attempts to rewrite using each of the 
equalities in order. Assumptions in the theorem are discharged by recursive 
calls to prove as in the previous case. 


The overall procedure is defined as a mutual recursion between two functions 
prove and norm. The norm function receives a term and a list of assumptions as 
input. It first recursively applies itself to the subterms of the term. Next, it looks 
for simplification rules associated to the head of the term and applies them in 
turn. If the head changes, the process is repeated. Note the prove function may 
be called to discharge assumptions of rewrite rules. This continues until the term 
is not changed by the simplification rules. The prove function takes a goal and a 
list of assumptions as input. It first simplifies the goal, then look for prove rules 
associated to the head term and applies each of them in turn. The case where 
the goal is an equality reduces to simplifying both sides and then comparing 
whether they are the same. 


5.4 Inequality Checking 


A major task of proof automation is checking inequalities in one variable x 
constrained to lie in an interval [a,b] or (a,b). For example, if one wishes to 
simplify y f(x)? to f(x) in the integrand, where the integral is from a to b, 
one needs to check f(a) > 0 in the open interval (a,b). Here f may involve the 
usual arithmetic operations, as well as logarithm, exponential, and trigonometric 
functions. 

The general problem of inequality checking is undecidable when special func- 
tions are involved. Hence, we can only hope for methods that can solve most of 
the inequality goals that appear in practice. There are many heuristic methods [7] 
as well as decision procedures for inequalities. For our purposes, we found the 
following, which can be considered as a simplified version of interval arithmetic, 
to be both simple and effective: starting from the assumption that xv lies in a cer- 
tain interval, iteratively deduce the intervals constraining each of the subterms 
in the expression. The derivation for each subterm depends on the head of the 
subterm. Of course, this method is incomplete as it tends to over-approximate 
the intervals of terms formed from binary operators. Implementation of more 
advanced inequality checking methods is a goal for the future. 


5.5 Simplification 


Simplification for arithmetic operations follows the same principle as in Section 
expand the expression into polynomial form, but do not expand powers. 
We also do not reduce rational functions. This is similar to the normalization of 
polynomials in other implementations of proof automation [7]. 
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More precisely, define a monomial to be a term of the form c- (ajtas? +--+ aù"), 
where c is a rational number, and each a; is either a prime number or a term 
whose head is not an arithmetic operator. If a; is a prime number, then the 
corresponding p; must be either non-constant or a rational number between 0 
and 1 exclusive. The a;’s are distinct and sorted in a pre-determined order. A 
rational number is a special case of a monomial, with k = 0. We call c the 
coefficient of a monomial and aĵ'a}? ---a?* its body. A polynomial is a sum of 
monomials, whose bodies are all distinct’ and in sorted order. It is clear that 
any expression can be simplified into this form. For example, //6V/2(x + 3?/%) is 
simplified to 


6)/291/25 + 61/29 1/292/3 _ g1/291/251/2,, J 91/231/291/292/3 =). 31/2, Gs 31/6 


Simplification of polynomials is implemented in the simplification rules for +, 
x and power. a — b and a/b are simply reduced to a + (—1) -b and a- b7}, 
respectively. 

For logarithms and exponentials, we apply the standard simplification rules 
log 1 = 0, log(e”) = x and e? = 1,2 > 0 — el°8* = x. Simplifying trigonometric 
functions applied to special values is trickier, as we may need to add or pubtroet 
multiples of 7. For example, cos a is first rewritten to cos 3 and then to 5. 

When simplifying an integral over the closed interval [a,b], we apply the 
following congruence rule: 


Va € (a,b). f(x) ata) f f(z j= f g(x) dz. 


This allows us to assume z € (a, when simplifying f(z). 


5.6 Applying Theorems 


For proving continuity and differentiability, we set up the corresponding prove 
rules using lists of introduction rules. Some of these rules require assumptions 
that are discharged recursively. For example, the introduction rule for division 
is as follows: 


| continuous_on S f, continuous_on S g, Vx € S. g(x) #0] 
—> continuous_on S (Ax. f(x)/g(x)) 
Application of this rule involves recursively proving the three assumptions, in- 
cluding the use of inequality checking from Section 
Substitution and integration by parts are implemented by applying the cor- 


responding theorems. This is simple because the parameters of the rule already 
contain instantiations for all function variables. 


6 Evaluation and Examples 


We evaluated our prototype implementatiorf?] on problems taken from exam 
preparation books (Tongji), online problem lists by D. Kouba [27] (Kouba) and 


3 The code and examples are available online at https://github.com/bzhan/ 
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the MIT Integration Bee 2] (MIT). We also compared our results with Maple 
and WolframAlpha. Statistics from the evaluation are shown in Table 

Problem set Total || Solved | Ratio || Slagle | Ratio || Maple || WolframAlpha 
Tongji 36 36 100% 26 72% 32 35 
Kouba/Substitution 18 17 94% 13 72% 18 18 
Kouba/Exponentials 12 7 58% 7 58% 12 11 
Kouba/Trigonometric | 27 22 81% 11 41% 18 22 
Kouba/ByParts 23 22 96% 17 74% 23 23 
Kouba/LogArcTangent | 22 21 95% 13 59% 21 21 
Kouba/PartialFraction 20 16 80% 8 40% 18 20 
MIT /2013 25 20 80% 14 56% 20 24 
Total 183 161 88% 109 | 60% 162 174 

Table 1. Statistics on the problem lists. “Solved” indicates the number of problems 


for which proofs can be successfully reconstructed from human-provided computations. 
“Slagle” indicates the number of problems that can be solved by Slagle’s method, with 
successful proof reconstruction. “Maple” represents the number of problems solved by 
Maple. “WolframAlpha” represents the number of problems which WolframAlpha can 
give step-by-step solutions without exceeding its time limit. 


The Kouba problem lists are divided into different categories based on tech- 
niques used. With human-provided computation steps, we can reconstruct proofs 
for all of the Tongji problems, most of the problems in D. Kouba’s list, while 
problems from the MIT Integration Bee are more challenging (with the later 
years increasing in difficulty). Most of the failures are due to unable to show 
equality after simplification, and during inequality checking. Some are due to 
unsupported functions. 

We show two interesting examples from our case studies. SymPy (version 1.5) 
returns a wrong answer on the first example and times out on the second. The 
second example takes a long time even for Mathematica, and cannot be solved by 
its online version WolframAlpha. These examples demonstrate that our system 
avoids the common errors, and since the user can guide the computation step- 
by-step, is also able to verify integrals that are difficult even for sophisticated 
computer algebra systems. 

The first example (Tongji, #27) demonstrates the splitting of domain of in- 
tegration, as well as use of trigonometric identities. The integral is 


f y 1 + cos 22 dx 
0 


This integral is incorrectly evaluated by SymPy as 0. It is correctly evaluated 
by Mathematica almost instantly. 

The evaluation begins with application of trigonometric identities, rewriting 
the integrand to V1 + cos? x — sin? z and then to V2cos? x. For this, the user 
simply needs to select cos2x and then sin? z, and choose the desired rewrite 
targets. The resulting situation is similar to the example given in the introduc- 
tion. It is then necessary to split the domain of integration where cos x = 0. The 


system is able to automatically determine x = 5. The full computation is: 
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I =| V1+cos?2—sin? adr (Rewrite trig. rule TR11) 
0 


= 1 v2 cos? «dx = vaf |cosz|dz (Rewrite trig. rule TR5, Simplification) 
0 0 


us 


2 


= V2 (/ 2 | cos x| dx +f | cos x| is) (Split region with c = 5) 
0 


= 2/2 (Elim absolute value, Simplification) 


The second example comes from MIT Integration Bee 2019, problem #14: 
ee re sin(20x) + sin(192) 
AG cos(202) + cos(192) 
It is simple if one notices to apply the sum-to-product identity first, but almost 
impossible otherwise. WolframAlpha fails to find the symbolic answer. Using 
Mathematica offline, it takes about 15 seconds to return an answer, which is 
however much more complicated than necessary. 
The full computation using our tool is: 


7/100 gin (22) 
l= f — dx (Rewrite trigonometric, rule TR9) 
o cos (222) 
i 21 39 
= —-—dt (Substitution I with g = = 
Las (Substitution I with g cos (2x) 
=— a log (cos xn) (Simplification). 


7 Conclusion 


In this paper, we proposed a framework for verifying symbolic computation 
of definite integrals, where the user can perform computations in an interface 
familiar from computer algebra systems, but with results verified by automatic 
translation to proofs in higher-order logic. The design of the framework follows a 
layered approach, with each layer focusing on a different aspect of the problem: 
methods for solving integrals, computer algebra, and proof reconstruction. We 
implemented a prototype system based on HolPy, and evaluated it on a test 
suite consisting of publicly available problem lists at the undergraduate level, 
showing its effectiveness on a large majority of cases. 

One immediate piece of future work is to secure the foundation of the higher- 
order logic proof, by formalizing the proofs of the required theorems. Another 
gap is the arithmetic computation and comparison of real constants, which, in 
the case of comparisons, would require approximation techniques [10]. 

Our prototype implementation focuses on definite integrals of one-variable 
functions. However, the idea can be applied more generally, by suitably extending 
the language of integration rules. For applications in the engineering domain, 
some extensions that would be of high value include linear algebra, improper 
integrals (including Laplace and Fourier transforms), and vector calculus. 
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Abstract. Commonsense reasoning has long been considered one of the 
holy grails of artificial intelligence. Our goal is to develop a logic-based 
component for hybrid — machine learning plus logic — commonsense ques- 
tion answering systems. A critical feature for the component is estimating 
the confidence in the statements derived from knowledge bases contain- 
ing uncertain contrary and supporting evidence obtained from different 
sources. Instead of computing exact probabilities or designing a new cal- 
culus we focus on extending the methods and algorithms used by the 
existing automated reasoners for full classical first-order logic. The pa- 
per presents the CONFER framework and implementation for confidence 
estimation of derived answers. 


1 Introduction 


The mainstream approaches for “commonsense reasoning” (CSR) before this 
century focused on rule based reasoning and building suitable logical systems. 
During the last ten years the focus has switched to machine learning and neural 
networks. Both of these approaches appear to be limited. A promising approach 
to practical question answering is building hybrid systems like Watson [17] which 
complement the current machine learning systems for natural language with 
logic-based reasoning systems specialized for CSR. In particular, hybrid systems 
have a good potential for progress towards explainable A.I. See Marcus [26] for 
an overview of the current work in the area. Our goal is to build upon the existing 
theory and reasoning systems for first order logic (FOL) to develop a framework 
and practical systems using FOL reasoners which could be incorporated into 
a hybrid system containing both machine learning components and rule-based 
reasoning components. This approach will also provide step-by-step proofs for 
the answers found, useful for building explainable systems. 

We will present the design and implementation of the CONFER framework 
for extending existing automated reasoning systems with confidence calculation 
capabilities. We will not focus on other, arguably even more critical issues for 
CSR and question answering, like handling natural language itself, dialogues, 
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rules with exceptions and default logic [31] or circumscription, knowledge rep- 
resentation for space/time, epistemic reasoning, using context, building and col- 
lecting suitable rules, machine learning etc. 

The specific CSR task targeted by the current paper is question answering: 
given either a knowledge base of facts and rules or a large corpora of texts (or 
both), plus optionally a situation description (assumptions) for the questions, 
answer questions posed either in logic or natural language. 

Historically, the longest-going CSR project has been the logic-based CYC 
project [25], already in 1985 stating the focus on CSR. Despite several successes, 
the approach taken in the CYC project has often been viewed as problematic ([8], 
[10]) and has been repeatedly used as an argument against logic-based methods 
in CSR. Beltagy et al [5] experiment with Markov Logic Network for combining 
logical and distributional representations of natural language meaning. Domin- 
gos et al note in [13] that the CYC project has used Markov Logic for making 
a part of their knowledge base probabilistic. Khot et al [24] experiment with 
Markov Logic Networks for NLP question answering. Furbach et al [20] describe 
research and experiments with a system for natural language question answer- 
ing, converting natural language sentences to logic and then performing proof 
search, using different existing FOL knowledge bases. The authors note a num- 
ber of difficulties, with the most crucial being the lack of sufficiently rich FOL 
knowledge bases. The closest current approach to ours appears to be the Braid 
system [23] built by the team previously involved with the Watson system. 


2 Interpretation and Encoding of Uncertainty 


Reasoning under uncertainty has been thoroughly investigated for at least a 
century, leading to a proliferation of different theories and mechanisms. A classic 
example is the MYCIN system [6]. For newer approaches see, for example, [32] 
and [9]. Each of these is well suited for certain kinds of problems and ill-suited 
for other kinds. Underlying this is the philosophical complexity of interpreting 
probability: see [22] for an overview, see also [16], pp. 5-7. 

Most of the previous work on combining logic with uncertainty has targeted 
propositional logic. First order logic is then handled by creating a finite set of 
weighted ground instances of formulas. This is the approach taken, for example, 
by the probabilistic logic programming systems ProbLog?2 [18], PRISM [34] and 
the implementation of Markov Logic Networks [12,11] by the Alchemy 2 system 
[1]. These systems pose different restrictions to the FOL formulas and while well- 
suited for small domains in cases the restrictions can be followed, the approach 
becomes unfeasible if the domain is large or formulas complex. For example, 
neither the ProbLog 2 nor Alchemy 2 implementations manage to answer queries 
like 1.0::p(a). 1.0::p(i(a,b)). 1.0::p(Y) :- p(X), pCi(X,yY)). 
query(p(b)). The implementation of ProbLog2 [29] fails, presumably due to 
infinite recursion in searching for possible groundings for the variables, while 
Alchemy 2 does not allow function terms in grounded facts. 
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Previous approaches to full first order logic tend to fall into one of the three 
camps: either using fuzzy logic [41], representing probabilities as an interval 
(see [15] for the axiomatic derivation of Dempster-Schafer rules) or interpreting 
probabilities via many worlds similarly to modalities [4]. 

For the sake of this work, we largely follow the subjective interpretation of 
probability as a degree of belief, originating from Ramsey and De Finetti. We 
use the word confidence to denote our rough adherence to this interpretation. We 
avoid using complex measures such as intervals, distributions or fuzzy functions. 

In the context of question answering we assume that confidences are typically 
used for sorting a list of candidate answers by their calculated confidence and 
optionally applying a filter to eliminate answers with a confidence under a certain 
threshold. Answers provided may be also annotated with a confidence number. If 
we are given or can calculate several different confidences for the same answer, 
we always prefer the higher confidence. The question of calculating a correct 
probability rarely arises or is considered to be unfeasible. 


2.1 Sources, Representation and Meaning of Statements, 
Confidences and Dependencies 


We assume that the confidence in a fact or rule in our common sense knowledge 
base (KB in the following) typically arises from a large number of human users 
via crowd-sourcing like in ConceptNet [35,7], NLP-analyzed scraped text from 
the web like NELL [27], and/or combining different knowledge bases with weights 
like in [14] and [7] or assigned to the equivalence of name pairs in the vocabulary 
like in [28] and [19]. There is recent progress towards making knowledge bases 
for common sense reasoning where the relation strengths (typicality, saliency) 
have been empirically evaluated [7,33]. 

To each FOL statement S we will assign both a confidence c and a set L 
of unique identifiers of (non-derived) input statements used for deriving this 
statement: a triple (S,c, L). Lists of such triples are then treated as sets. The 
dependency lists L are used in the formula estimating the cumulated confidence. 
The algorithm for calculating confidences c for derivations will be presented 
later. 

To be more exact, we will not allow assigning confidences to arbitrary state- 
ments. Instead, we will assume that the FOL statements are converted to a 
conjunctive normal form: a conjunction of Skolemized disjunctions, where each 
disjunction only consists of atomic statements (a predicate applied to arguments) 
or negations of atomic statements. Such disjunctions are called clauses. We will 
not allow nested triples, i.e. S is always a pure FOL clause not containing any 
confidence or dependency information usable by the presented algorithms. How- 
ever, for each single FOL clause S there may be many different derivable triples 
(S, c, L) for different c and L, stemming from different derivation trees of S. They 
are assumed to be independent statements, possibly allowing the calculation of 
the cumulative confidence for S higher than maz(c,c’) where c and c come from 
different triples. 
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A KB may contain logical contradictions and identical FOL clauses with 
different confidences given by different sources. For example, the following is 
a logically contradictory KB containing several copies of the same clause with 
different confidences. The CONFER algorithm presented later gives us the con- 
fidence of bird(a) : 0.682 from this KB: 


(bird(X), 0.1, £1), (bird(a), 0.8, L2), (bird(a), 0.9, L3), (abird(a), 0.3, L4) 


We interpret the confidence as estimating the lower limit of the probability 
of a statement, i.e., (S,c,Z) is interpreted as “statements L support the claim 
that probability( S) > c”. Thus two different confidence statements for the same 
clause are never contradictory, even if given by the same source. 


3 The CONFER Extension Framework for CSR 


In the following we will present the CONFER framework of extensions to the 
mainstream resolution-based search methods. We expect that the same frame- 
work can be adapted to search methods different from resolution, i.e. the specific 
aspects of resolution are not relevant for the main principles of the approach. 

The intuition behind CONFER is preserving first order classical logic (FOL) 
intact as an underlying machinery for derivations in CSR. The core methods of 
automated reasoning used by most of the high-performance automated reason- 
ing systems remain usable as core methods for CSR. Essentially, FOL with the 
resolution method produces all combinations of derivable sentences (modulo sim- 
plifications like subsumption) which could lead to a proof. The main difference 
between strict FOL and CONFER extensions is in the handling of constructed 
proof trees: the outcome of a CONFER reasoner is a set of combined FOL proofs 
with the confidence measures added. 

Importantly, the framework does not generally calculate the exact maximal 
confidence for derived statements, since this is, in nontrivial cases, either im- 
possible or unfeasible. Our goal is to give a practically useful estimation of the 
maximal confidence without causing a large overhead on the FOL proof search 
and avoiding combinatorial explosion while calculating the confidences. 


3.1 Resolution Method 


In the following we will assume that the underlying first order reasoner uses the 
resolution method, see [3] for details. The rest of the paper assumes familiarity 
with the basic concepts, terminology and algorithms of the resolution method. 


3.2 Queries and Answers 


We assume the question posed is in one of two forms: (1) Is the statement Q 
true? (2) Find values V for existentially bound variables in Q so that Q is true. 
For simplicity’s sake we will assume that the statement Q is in the prefix form, 
i.e., no quantifiers occur in the scope of other logical connectives. 
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In the second case, it could be that several different value vectors can be 
assigned to the variables, essentially giving different answers. We also note that 
an answer could be a disjunction, giving possible options instead of a single 
definite answer. However, as shown in [38], in case a single definite answer exists, 
it will be derived eventually. 

A widely used machinery in resolution-based theorem provers for extracting 
values of existentially bound variables in Q is to use a special answer predicate, 
converting a question statement Q to a formula 


4X1, ..., Xn (Q(X, ..., Xn)&ranswer(X1, ..., Xn)) 


for existentially quantified variables in Q [21]. Whenever a clause is derived which 
consists of only answer predicates, it is treated as a contradiction (essentially, 
answer) and the arguments of the answer predicate are returned as the values 
looked for. A common convention is to call such clauses answer clauses. We will 
require that the proof search does not stop whenever an answer clause is found, 
but will continue to look for new answer clauses until a predetermined time limit 
is reached. See [37] for a framework of extracting multiple answers. 

We also assume that queries take a general form (KB&A) > Q where KB is 
a commonsense knowledge base, A is an optional set of precondition statements 
for this particular question and Q is a question statement. 

Since we assume the use of the resolution method for proof search, the whole 
general query form is negated and converted to clauses, i.e., disjunctions of lit- 
erals (positive or negative atoms). We will call the clauses stemming from the 
question statement question clauses. 


3.3 Top Level of the Algorithm 


Calculating confidences for question answering requires, at least, the ability to 
calculate (a) the decreasing confidence of a conjunction of clauses as performed 
by the resolution and paramodulation rule, (b) the increasing confidence of a 
disjunction of clauses for cumulating evidence, (c) the decreasing confidence of 
considering negative evidence for a clause. 

While the systems based on, say, Bayes networks and Markov logic, perform 
these operations in a combined manner, our framework will split the whole search 
into separate phases for each. First we perform a modified resolution search we 
call c-resolution calculating the decreasing confidence and potentially giving a 
large number of different answers and proofs. Next we will combine the different 
proofs using the cumulation operation. Finally we will collect negative evidence 
for all the answers obtained so far, separately for each individual answer. The 
latter search is also split into the c-resolution phase and the cumulating phase. 
Since we assume the use of full FOL, the c-resolution search will not necessar- 
ily terminate, thus we will use a time limit. The top level of the algorithm is 
presented in the following section as Algorithm 1. 
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Algorithm 1 CONFER algorithm 


Input: Common sense knowledge base KB, question Q, time limit t. 
Output: Set of answers R with attached confidences. 

1: Let R={}. 

2: Find a set of initial positive answers with confidences and dependencies 
IPA={(A1,¢1, £1), ..., (Ap, Cp, Lp) } for Q from KB using c-resolution with the time 
limit ¢/2. 

3: Calculate a set of cumulative positive answers CPA={(Bj, di, F1),..., (Br, dr, Er) } 

from IPA. 

Let i= 1. 

while i <= r do 
Form the negated question NQ; from ~Q with a substition s given by Bi 
Find a set of initial negative answers N; with confidences and dependencies for 
NQ; from KB using c-resolution with the time limit t/(2 x r). 

8: if N; is empty then 


9: Let nc; = 0. 

10: else 

Ki; Calculate the cumulative negative confidence nc; from Ni. 
12: end if 


13: Add a pair (Bj, (ci — nci)) to R. 

14: Leti=i4+1. 

15: end while 

16: For each pair (Bi,di), (Bj, dj} in R where i Æ j, Bi = Bj and d; >= dj, remove 
Bj : dj from R. 

17: Remove from R all elements (B;, di) where d; <= 0. 

18: return the set of answers with confidences R. 


3.4 C-Resolution 


The core part of the algorithm described above is c-resolution: a relatively sim- 
ple modification of the resolution method calculating and keeping track of the 
(multiplied) confidences of premisses of each step along with the union of their 
dependencies. 


Definition 1 (C-Resolution). A modification of the resolution method com- 
puting an ever-increasing set of different proofs for different answers (substi- 
tutions to the question clauses) while employing the relevance filter (definition 
2), performing basic confidence calculation for resolution steps (definition 3), 
assigning the union of the dependency lists of premisses to each derived clause, 
restricting subsumption to c-subsumption (definition 5) and restricting simplifi- 
cation steps according to c-subsumption. 


Inconsistencies. A KB with a nontrivial structure may contain inconsistencies 
in the sense that a contradiction can be derived from the KB. Looking at existing 
KBs mentioned earlier, we observe that they either are already inconsistent (for 
example, the largest FOL version of OpenCyc [30] in TPTP [40] is inconsistent) 
or would become inconsistent in case intuitively valid inequalities are added, 
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for example, inequalities of classes such as “a cat is not a dog”, “a male is 
not a female” or default rules such as “birds can fly”, “dead birds cannot fly”, 
“penguins cannot fly”. We note that several large existing KBs do not contain 
such inequalities explicitly, although they are necessary for nontrivial question 
answering under the open-world assumption. 

Since classical FOL allows to derive anything from a contradiction, it is 
clearly unsuitable for a large subset of KB-s. Two possible ways of overcoming 
this issue are: (a) using some version of relevance logic or other paraconsistent 
logics or (b) defining a filter for eliminating irrelevant classical proofs. We argue 
that despite a lot of theoretical work in the area, only little work has been done 
in automated proving for relevance logic, thus using it directly is likely to create 
significant complexities. Instead, we introduce a simple relevance filter: 


Definition 2 (Relevance Filter). Each resolution derivation of a contradic- 
tion not containing any answer clauses is discarded. 


Since a standard resolution derivation of a contradiction does not lead to any 
further derivations, this filter is completeness-preserving in the sense that all 
resolution derivations containing an answer clause are still found. 


Confidences of Derived Clauses. We take the approach of (a) providing a 
simple sensible baseline algorithm for calculating confidences of derived clauses, 
and (b) leaving open ways to modify this algorithm for specific cases as need 
arises. We will use a single rational number in the range 0...1 as a measure of a 
confidence of a clause, with 1 standing for perfect confidence and 0 standing for 
no information. Confidence of an atomic clause not holding is represented as a 
confidence of the negation of the clause. 

As a baseline we use the standard approach of computing uncertainties of 
clauses derived from independent parent clauses A and B as: 


P(AA B) = P(A) « P(B) 


Notice that for dependent parent clauses this formula under-estimates the con- 
fidence of the result. 


Definition 3 (Basic Confidence Calculation for Resolution Steps). For 
binary resolution and paramodulation steps, the confidence of a result is obtained 
by multiplying the confidences of the premises. For the factorization step, the 
confidence of the result is the confidence of the premise, unchanged. Question 
clauses have a confidence 1. 


A simple example employing forward reasoning (concretely, negative ordered 
resolution): 


: bird(tweety) . 

:: bird(X) => canfly(X). 
:: canfly(X) => fast(X). 
:: fast(X) => answer(X). 


OOO 
ON © © 
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leads to a sequential derivation of 


0.72:: canfly(tweety). 
0.504:: fast(tweety). 
0.504:: answer(tweety). 


Recall that the confidences are assumed to be lower bounds of probabili- 
ties. Notice that the possible dependence of the premises could be taken into 
account, as in the following section for cumulative evidence. This would result 
in higher confidence numbers for derivations with dependent premises. Consider 
the following example: 


0.9:: bird(X) => canfly(X). 
O.1:: -bird(X) => canfly(X). 


Using the basic calculation step we can derive that anything can fly: 

0.09:: canfly(X). However, since anything is either a bird or is not a bird, the 
confidence of canfly(X) should be at least 0.1, and possibly higher, depending 
on the ratio of birds to non-birds. 

Generally, we can use the minimization operation leading to a higher confi- 
dence value than the multiplication of the confidences of premises in the follow- 
ing special case. The standard resolution inference rule used by a large class of 
automated reasoners is defined as 


A; V Ap V... V An =B; V Bə V ... V Bm 
(Ag Vae V An V Bə Viv By)o 


where ø is the most general unifier of A; and B1. A clause A subsumes a clause 
B if the literals of Að are a subset of literals of B for some substitution ô. 


Definition 4 (Extended Confidence Calculation for Resolution Steps). 
If (AgV...VAn)o subsumes (Bo V ... V Bm)o in the resolution inference defined 
above then the confidence of the result is the minimum of the confidences of 
premises. 


C-Subsumption and Simplifications. Since standard subsumption used by 
resolution provers to clean up search space may remove clauses with a higher 
confidence or fewer dependencies than the subsuming clause, it may cause the 
prover to lose derivations potentially leading to a higher confidence. Thus we 
use c-subsumption instead of the standard subsumption: 


Definition 5 (C-Subsumption). A triple Ty = (Ai, c1, L1) consisting 
of a clause A,, confidence cı and a dependency list Lı c-subsumes a triple 
T> = (A2, c2, L2) if and only if Ay subsumes Ag, cı > c2 and Ly C Ly. 


We can prove the following lemma: 


Lemma 1 (C-Subsumption Preserves Completeness). When a 
c-resolution proof can be found without using subsumption, it can be also found 
with c-subsumption. 
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The proof holds for strategies of resolution for which standard subsumption is 
complete for ordinary proof search without confidences. 

We restrict the simplification operations like demodulation and subsuming 
resolution accordingly: a derivation step must keep the original premiss P if the 
result has a lower confidence or a longer list of dependencies than P. 


3.5 Cumulative Confidence 


We will now look at the situation with additional evidence for the derived answer. 
In our context, using additional evidence is possible if a clause C can be derived 
in different ways, giving two different derivations dı and dz with confidences c1 
and c2. In case the derivations dı and dz are independent, we could apply the 
standard formula 


P(AV B) = P(A) + P(B) -— P(AAB) 


to c, and c2 to calculate the cumulative confidence for C. 

What would it mean for derivations to be “independent”? In the context 
of commonsense reasoning we cannot expect to have an exact measure of in- 
dependence. However, suppose the derivations dı and dz consist of exactly the 
same initial clauses, but used in a different order. In this case cy = co and the 
cumulative confidence should intuitively be also just cı: no additional evidence 
is provided. On the other hand, in case that the non-question input clauses of 
dı are d2 are mutually disjoint, then the derivations are also independent (as- 
suming all the input clauses are mutually independent), and we should apply 
the previous rule for P(A V B) for computing the cumulative confidence. 

We will estimate the independence i of two derivations dı and də simply as 


number of shared input clauses of dı and d2 


(1) 


Thus, if no clauses are shared between dı and d2, then i = 1 and if all the clauses 
are shared, then i = 0. 

In addition, we also know that it is highly unlikely that all the input clauses 
are mutually independent. Again, lacking a realistic way to calculate the depen- 
dencies, we give a heuristic estimate h in the range 0...1 to the overall indepen- 
dence of the input clause set, where 1 stands for total independence and 0 for 
total dependence. 

Finally, we will calculate the overall independence of two derivations dı and 
dz as i x h. Next, we will postulate a heuristic rule for the combination of these 
two independence measures as follows. 


total number of input clauses in dı and də 


Definition 6 (Confidence Calculation for Cumulative Evidence). Given 
two derivations dı and dz of the search result C with confidences cy and c2, 
calculate the updated confidence of C as 


man(cy +c, *i*xh, cı xix h+ c2)— cı * CQ et Kh 


where 
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— independence of derivations i is defined as 1 above, 
— his the heuristic estimate of the independence of the total set of input clauses 
from 1 for total independence to 0 for total dependence. 


The formula satisfies the following intuitive requirements for cumulative ev- 
idence: 


— If dı and də do not share non-question input clauses and all the input clauses 
are mutually independent, i*h = 1 and the formula turns into c1 + c2 — (c1 * 
C2). 

— If dı and də have the same non-question input clauses or the total set of 
input clauses is mutually totally dependent, ix h = 0 and the formula turns 
into maz(c1, c2). 


3.6 Negative Evidence 


Recall the standard mechanism employed in FOL provers for finding concrete 
answers: transforming existentially quantified goal clauses to clauses containing a 
special answer predicate and treating clauses containing only answer predicates 
as actual answers to the question found. 

Once negation is present, the reasoning system using the CONFER frame- 
work has to attempt to find both positive and negative evidence for any potential 
answer. This cannot be easily done in a single proof search run. 

Observe that giving a general search question containing variables like 
bird(X) V answer(X) may produce a different set of answers than the positive 
question abird(X) V answer(X). Also observe that the potential set of answers 
may be huge for both positive and negative answers: in a large KB there may 
be millions of statements about birds and our reasoning system will be able to 
derive only a small fraction of potential answers in any given time slot. Thus, 
even if negative evidence is potentially derivable for some positive answer, the 
system is unlikely to find it. 

A reasonable solution to this problem is to run the searches for negative evi- 
dence only for the concrete instances of positive answers found. More concretely, 
we conduct additional proof search for the negations of two types of questions 
Q: (a) If Q contains no existentially quantified variables, is the statement =Q 
true? (b) For all i vectors of values Ci, ..., Cni found for existentially bound vari- 
ables X1, ...Xn in Q making Q true, is ~Q true when we substitute the values in 
Cii, ..., Cni for corresponding variables in Q?. The final confidence of an answer 
to Q is calculated by subtracting from the confidence of the positive answer the 
confidence of the answer to the corresponding negated instance of the question. 

Using negative evidence may lead to unexpected results. Consider the fol- 
lowing trivial example in the ProbLog syntax: 


0.5::bird(a). 0.5::not bird(a). query(bird(a)). 


CONFER gives us confidence 0, which we interpret as “no information”, not as 
“false”. However, ProbLog2 gives confidence 0.25, which is explained by one of 
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the authors in private correspondence thus: an atom (head) is satisfied if any of 
the rules that make it true fire and none of the rules that make it false fire. In 
this example ProbLog2 gets 0.5 » (1 — 0.5) = 0.25. On the other hand, the three 
different algorithms of the Alchemy 2 system — MC-SAT explained in [12], exact 
and approximate probabilistic theorem proving explained in [11] — give answers 
0.015, 0 and 0.082, respectively. To be concrete, we are using the Alchemy 2 
versions from [2]. For this and the following Alchemy 2 examples we prepared 
an MLN file with no weights and a training data file with some generated facts 
for each example. Then we ran the learnwts program with default parameters, 
which created the MLN file with weights for each example. 
Next, consider a previous example augmented with the “birds fly” rule: 


0.5::bird(a). 0O.5::not bird(a). 0.9:: flies(X) :- bird(X). 
query (flies(_)). 


Here CONFER gives us 0.45, which is inconsistent with the result of the previ- 
ous example. ProbLog2, on the other hand, gives 0.225, which is unintuitive, but 
consistent with the unintuitive result of ProbLog2 in the previous example. The 
three algorithms of Alchemy 2 mentioned above give us 0.047, 0 and 0.98. The 
issue arising in this example is similar to nonmonotonic reasoning like default 
logic: adding negative evidence to being a bird should block previously deriv- 
able facts. We know that since FOL is not decidable, such checks would make 
derivation steps generally not computable. As a final twist to the example we 
augment the ruleset by giving more details about the distribution: 


0.5::bird(a). 0.5::not bird(a). 0.9:: flies(X) :- bird(X). 
0.1:: not flies(X) :- bird(X). 

hh O.1:: flies(X) :- not bird(X). %% commented out 

0.9:: not flies(X) :- not bird(X). 

query(flies(_)). 


Here CONFER gives us an acceptable 0.014 (positive evidence 0.490 and 
negative evidence 0.476), while ProbLog2 gives 0.2025. The results of Alchemy 2 
are 0.047, 0 and 0.976. Adding the rule we have commented out makes CONFER 
to give -0.008 while ProbLog2 complains that the example is not acceptable. 
Alchemy 2 gives us 0.056, 0 and 0.509. 


4 Implementation and Experimental Results 


The first author has implemented the CONFER framework as an extended 
version of his high-performance open-source automated reasoning system gkc 
[39] for FOL, performing fairly well in the yearly CASC competition for au- 
tomated reasoners [36], see http://www.tptp.org/CASC/. The implementation 
is written in C like gkc. The compiled executable can be downloaded from 
http: //logictools.org/confer/ along with a number of examples. 

Several algorithms, strategies and optimizations present in the gkc system are 
currently switched off, due to the need for additional modifications and testing. 
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In particular, parallel processing is switched off, as well as the crucial algorithms 
for selecting a list of suitable search strategies and performing search by batches 
with iteratively increasing time limits. 

Importantly, we have not yet implemented any specialized strategies for using 
the attached confidences and dependencies for directing and optimizing search. 
It is clear that the added information gives ample potential opportunities for 
directing the search. 

We will give an overview of the experiments with the implementation in two 
sections. First we will look at the confidences calculated and compare these, 
where possible, with the values given by ProbLog2 and Alchemy 2. Next we will 
look at the performance of the system on nontrivial problems. 

The inputs and outputs for the CONFER implementation and the systems 
compared to are given on the web page http://logictools.org/confer/. The set of 
examples given contains over 30 case studies and can be run using the command- 
line implementation provided on the same web page as a single executable file. 
The implementation is self-contained, not dependent on other systems or exter- 
nal libraries. It should run on any 64-bit Linux system. 


4.1 Comparing Confidences 


We will compare the confidences calculated by CONFER on small selected ex- 
amples with these of ProbLog2 and Alchemy 2. The first two are presented in the 
ProbLog2 tutorial. When CONFER can perform neither cumulation nor collec- 
tion of evidence, the values calculated are the same as of ProbLog2. The cumula- 
tion operation of CONFER produces, as expected, slightly different values than 
ProbLog2 or Alchemy 2. For the following examples the overall independence 
estimate h is assigned 1 (maximum). Since the principles of handling negative 
evidence are fundamentally different between the two systems, this operation 
causes the most significant changes. It is worth noticing that more often than 
not, the results of ProbLog2 and Alchemy 2 also differ. 

First, a simple version of the well-known social networks of smokers example 
in the ProbLog syntax. CONFER uses a different syntax, but the clauses and 
confidences given are exactly the same. We have also built the corresponding 
data- and rulesets for Alchemy 2, which uses a fairly different input method 
than CONFER or ProbLog. 


0.8::stress(ann). 0.4::stress (bob). 
0.6::influences(ann, bob). 0.2::influences(bob,carl). 
smokes(X) :- stress(X). 


smokes(X) :- influences(Y,X), smokes(Y). 
query (smokes(carl)). 


For this example, ProbLog2 gives an answer 0.1376 and CONFER gives 
0.1201, cumulating values 0.096 and 0.08. The three different algorithms of 
Alchemy 2— MC-SAT inference (see [12]), exact and approximate lifted inference 
explained in [11] — give 0.135, 0 and 0.741, respectively. In the following tables we 
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will refer to these three as Alch i, Alch e and Alch a. Removing the input clause 
0.4::stress(bob) also removes the cumulation possibility and both CONFER 
and ProbLog?2 give 0.096 as an answer. 

Next, the well-known earthquake example. CONFER performs both cumu- 
lation and collecting negative evidence. 


person(john). person (mary). 

0.7: :burglary. 0.2: :earthquake. 
0.9: :alarm :- burglary, earthquake. 
0.8: :alarm :- burglary, \+earthquake. 
0.1::alarm :- \+burglary, earthquake. 
0.8::calls(X) :- alarm, person(X). 
0.1::calls(X) :- \talarm, person(X). 


evidence(calls(john) ,true). 
evidence (calls (mary) , true) . 
query (burglary). 

query (earthquake) . 


We will present the ProbLog2 and CONFER results with both the positive 
and negative evidence components (columns CONFER + and CONFER -) given 
by CONFER. Importantly, by default CONFER will try to find up to 10 different 
proofs: increasing or decreasing these limits has a noticeable effect on the results 
as well as running time. 


query CONFER|CONFER +|CONFER -|Problog|Alch ijAlch e| Alch a 
burglary 0.8713 0.97650 0.1051} 0.9819] 0.709 0/0.905095 
earthquake] 0.1648 0.8854 0.7206] 0.2268] 0.204 0} 0.888 


Finally we bring the famous penguin example from default logic. We will 
formulate it using confidences instead of defaults. We state that penguins form 
a tiny subset of birds. The CONFER implementation collects both positive and 
negative evidence, but there are no cumulation possibilities. 


1.0::bird(tweety). 


1.0:: bird(X) 


1.0: :penguin (pennie). 


:- penguin(X). 


0.001:: penguin(X) :- bird(X). 

0.9:: flies(X) :- bird(X). 

1.0:: not flies(X) :- penguin(X). 

query (flies(_)). 

query CONFER|CONFER +/CONFER -|Problog| Alch iļAlch e}/Alch a 
flies(pennie) -0.1 0.9 1.0 0|0.00001 0 0 
flies (tweety) 0.899 0.9 0.001} 0.8991); 0.064 0| 0.873 
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4.2 Performance 


We will investigate the performance of our CONFER implementation on the 
following nontrivial example FOL problems from the TPTP collection [40]. Due 
to restrictions in the language or the principles of the search algorithm, ProbLog2 
cannot handle any of these examples even if they are converted to clauses in 
ProbLog syntax. Thus we will compare the performance of the CONFER system 
on several modifications of the problems against the conventional FOL prover 
gkc used as a base for building the CONFER system. 

The results are given for the following problems with the TPTP identifier 
and ratings: 0 means all the provers tested by the TPTP maintaners find a 
proof, 1 means no prover manages to find a proof. Steamroller (PUZ031+41, 
rating 0) is a puzzle without equality. Dreadbury (PUZ001+42.p, rating 0.23) is a 
puzzle using also equality. Lukasiewicz (LCL047-1.p, rating 0) is an example in 
logical calculi. Commonsense reasoning problems from CYC are taken from the 
largest consistent CYC version in TPTP: CSR025+5, CSR035+5,CSRO4{5+5, 
CSR055+5 (ratings 0.67, 0.83, 0.97, 0.87). 

The CYC problems CSR025+5 ... CSR055+5 contain ca half a million for- 
mulae, but the proofs are relatively short. The first three problems are relatively 
small, but their proofs are significantly longer. The Steamroller, Dreadbury and 
the CYC CSR035+5 problems have been augmented with a question asking for 
answer substitutions, while for the other CYC problems and the Lukasiewicz 
problems the conjectures do not contain the existence quantifier, thus we just 
try to prove these. For comparison purposes the CONFER proof searches are 
restricted to finding only the first answer (thus no cumulation is possible) and 
not collecting negative evidence. 

We consider both the versions of problems with all clauses assigned a confi- 
dence between 0.6 ... 0.99 cyclically with a step 0.01 (column CONFER in the 
following table) and all the confidences assigned 1.0 (column CONFER 1.0). It 
is important to note that the CONFER system uses conventional subsumption 
and simplification for clauses with the confidence 1.0, i.e. in the “CONFER 1” 
column proof search is reduced to the ordinary resolution search. The gkc col- 
umn gives the pure search time of the gkc prover used as a base for building the 
CONFER system, for the original TPTP versions (without a question of sub- 
stitutions being asked). As a special case, variations 0 ... 4 of the Lukasiewicz 
problem are formed by attaching confidences below 1 to respectively 1... 4 input 
clauses and letting other confidences have value 1.0. (the Lukasiewicz problem 
consists of five clauses, one of these being the clause to be proved). 

The columns CONFER ... “gkc pure” contain the pure proof search time in 
seconds using negative ordered resolution for all the problems except CYC and 
the set of support resolution for CYC. The gkc column gives the pure search 
time for the gkc prover used as a base for building the CONFER system, for the 
original TPTP versions (without a question of substitutions being asked). Pure 
search time does not include printing, parsing and clausifying the problem and 
indexing the formed clauses. The final column “gke full” gives full wall clock 
time for gkc. 
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Problem /CONFER|CONFER 1.0/gkc pure|gkc full 
Steamroller 0.0018 0.0015 0.001 0.06 
Dreadbury 0.0017 0.0011 0.001 0.06 
Lukasz 0 0.0916 0.093 0.22 
Lukasz 1 0.913 
Lukasz 2 23 
Lukasz 3 19 
Lukasz 4 16 
CSR025+5 0.0004 0.0001) 0.0001 4.5 
CSR035+5 0.0001 0.0001 0.07 4.6 
CSR045+5 3.418 1.4 1.3 5.8 
CSR055+5 0.0001 0.0001) 0.0001 4.5 


We can observe that the confidence and dependency collecting calculations 
along with the restricted c-subsumption do not have a noticeable effect on per- 
formance for most of these problems. However, adding confidences below 1 to 
the Lukasziewicz problem do incur a significant penalty, which — surprisingly 
— diminishes somewhat when all the clauses have such confidences. The confi- 
dences incur a noticeable penalty to CSR045+5, which has the longest proof 
among our CYC examples. Our hypotheses is that for these examples the c- 
subsumption along with restricted simplification changes the direction of the 
search significantly. 


5 Summary and Future Work 


We have presented a novel framework CONFER along with the implementation 
for reasoning with approximate confidences for full, unrestricted first order logic. 
The presented examples demonstrate that the confidences found by our imple- 
mentation are similar to the confidences found by the leading probabilistic Prolog 
and Markov logic implementations ProbLog2 [18] and Alchemy 2 [1]. CONFER 
is based on conventional first order theorem proving theory and algorithms not 
requiring saturation, differently from the systems using weighted ground satu- 
ration of FOL formulas like ProbLog2 and Alchemy 2. We have shown that this 
enables the CONFER implementation to efficiently solve large nontrivial FOL 
problems with attached confidences. 

We plan to continue work on the CONFER implementation in several di- 
rections: finding and removing bugs, improving the functionality and devising 
search strategies specialized for the FOL formulas with associated confidences. 
We expect to integrate machine learning approaches, in particular using seman- 
tic similarities for reasoning with analogies and estimating the relevance of input 
clauses for proof search guidance. The goal of this work is creating a practically 
usable component for logic-based question answering from large commonsense 
knowledge bases. 
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Abstract. The state-of-the-art superposition-based theorem provers for first-or- 
der logic rely on simplification orderings on terms to constrain the applicability 
of inference rules, which in turn shapes the ensuing search space. The popular 
Knuth-Bendix simplification ordering is parameterized by symbol precedence—a 
permutation of the predicate and function symbols of the input problem’s signa- 
ture. Thus, the choice of precedence has an indirect yet often substantial impact 
on the amount of work required to complete a proof search successfully. 

This paper describes and evaluates a symbol precedence recommender, a machine 
learning system that estimates the best possible precedence based on observations 
of prover performance on a set of problems and random precedences. Using the 
graph convolutional neural network technology, the system does not presuppose 
the problems to be related or share a common signature. When coupled with the 
theorem prover Vampire and evaluated on the TPTP problem library, the recom- 
mender is found to outperform a state-of-the-art heuristic by more than 4% on 
unseen problems. 


Keywords: saturation-based theorem proving - simplification ordering - symbol 
precedence - machine learning - graph convolutional network 


1 Introduction 


Modern saturation-based Automatic Theorem Provers (ATPs) such as E [34], SPASS 
(40), or Vampire employ the superposition calculus as their underlying in- 
ference system. Integrating the flavors of resolution [5], paramodulation [BO], and the 
unfailing completion [B], superposition is a powerful calculus with native support for 
equational reasoning. The calculus is parameterized by a simplification ordering on 
terms and uses it to constrain the applicability of inferences, with a significant impact 
on performance. 

Both main classes of simplification orderings used in practice, the Knuth-Bendix 
ordering and the lexicographic path ordering [16], are specified with the help of 
a symbol precedence, an ordering on the signature symbols. While the superposition 
calculus is refutationally complete for any simplification ordering [4], the choice of the 
precedence has a significant impact on how long it takes to solve a given problem. 

It is well known that giving the highest precedence to the predicate symbols in- 
troduced as sub-formula names during clausification can immediately make the 
saturation produce the exponential set of clauses that the transformation is designed to 
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avoid [29]. Also, certain orderings help to make the superposition a decision procedure 
on specific fragments of first-order logic (see, e.g., DIHA). However, the precise way 
by which the choice of a precedence influences the follow-up proof search on a general 
problem is extremely hard to predict. 

Several general-purpose precedence generating schemes are available to ATP users, 
such as the successful invfreq scheme in E [83], which orders the symbols by the 
number of occurrences in the input problem. However, experiments with random prece- 
dences indicate that the existing schemes often fail to come close to the optimum prece- 
dence [28], suggesting room for further improvements. 

In this work, we propose a machine learning system that learns to predict for an 
ATP whether one precedence will lead to a faster proof search on a given problem than 
another. Given a previously unseen problem, it can then be asked to recommend the best 
possible precedence for an ATP to run with. Relying only on the logical structure of the 
problems, the system generalizes the knowledge about favorable precedences across 
problems with different signatures. 

Our recommender uses a relational graph convolutional neural network to rep- 
resent the problem structure. It learns from the ATP performance on selected problems 
and pairs of randomly sampled precedences. This information is used to train a symbol 
cost model, which then realizes the recommendation by simply sorting the problem’s 
symbols according to the obtained costs. 

This work strictly improves on our previous experiments with linear regression 
models and simple hand-crafted symbol features |6] and is, to the best of our knowl- 
edge, the first method able to propose good symbol precedences automatically using a 
non-linear transformation of the input problem structure. 

The rest of this paper is organized as follows. Section [2] exposes the basic termi- 
nology used throughout the remaining sections. Section [B] proposes a structure of the 
precedence recommender that can be trained on pairs of symbol precedences, as de- 
scribed in Sect. [4] Section[5|summarizes and discusses experiments performed using an 
implementation of the precedence recommender. Section |6}compares the system pro- 
posed in this work with notable related works. Section aoaaa the investigation 
and outlines possible directions for future research. 


2 Preliminaries 


2.1 Saturation-Based Theorem Proving 


A first-order logic (FOL) problem consists of a set of axiom formulas and a conjec- 
ture formula. In a refutation-based automated theorem prover (ATP), proving that the 
axioms entail the conjecture is reduced to proving that the axioms together with the 
negated conjecture entail a contradiction. The most popular first-order logic (FOL) au- 
tomated theorem provers (ATPs), such as Vampire [2I], E [B4], or SPASS [40], start the 
proof search by converting the input FOL formulas to an equisatisfiable representation 
in clause normal form (CNF) [2503]. We denote the problem in clause normal form 
(CNF) as P = (X, Cl), where X is a list of all non-logical (predicate and function) 
symbols in the problem called the signature, and Cl is the set of clauses of the problem 
(including the negated conjecture). 
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Given a problem P in CNF, a saturation-based ATP searches for a refutational proof 
by iteratively applying the inference rules from the given calculus to infer new clauses 
entailed by Cl. As soon as the empty clause, denoted by QU, is inferred, the prover 
concludes that the premises entail the conjecture. The sequence of those inferences 
leading up from the input clauses C? to the discovered [O constitutes a proof. If the 
premises do not entail the conjecture, the proof search continues until the set of inferred 
clauses is saturated with respect to the inference rules. In the standard setting of time- 
restricted proof search, a time limit may end the process prematurely. 

Since the space of derivable clauses is typically very large, the efficacy of the prover 
depends on the order in which the inferences are applied. The standard saturation-based 
ATPs order the inferences by maintaining two classes of inferred clauses: processed and 
unprocessed [34]. In each iteration of the saturation loop, one clause (so-called given 
clause) is combined with all the processed clauses for inferences. The resulting new 
clauses and the given clause are added to the unprocessed set and the processed set, 
respectively. Finishing the proof in few iterations of the saturation loop is important 
because the number of inferred clauses typically grows exponentially during the proof 
search. 


2.2 Superposition Calculus 


The superposition calculus is of particular interest because it is used in the most suc- 
cessful contemporary FOL ATPs. A simplification ordering on terms |4] constrains the 
inferences of the superposition calculus. 

The simplification ordering on terms influences the superposition calculus in two 
ways. First, the inferences on each clause are limited to the selected literals. In each 
clause, either a negative literal or all maximal literals are selected. The maximality is 
evaluated according to the simplification ordering. Second, the simplification ordering 
orients some of the equalities to prevent superposition and equality factoring from in- 
ferring redundant complex conclusions. In each of these two roles, the simplification 
ordering may impact the direction and, in effect, the length of the proof search. 

The Knuth-Bendix ordering (KBO) [T9], a commonly used simplification ordering 
scheme, is parameterized by symbol weights and a symbol precedence, a permutatior}’| 
of the non-logical symbols of the input problem. In this work, we focus on the task 
of finding a symbol precedence which leads to a good performance of an ATP when 
plugged into the Knuth-Bendix ordering (KBO), leaving all the symbol weights at the 
default value | as set by the ATP Vampire. 


2.3 Neural Networks 


A feedforward artificial neural network |12\ is a directed acyclic graph of modules. 
Each module is an operation that consumes a numeric (input) vector and outputs a 
numeric vector. Each of the components of the output vector is called a unit of the 


? The definition of KBO does not require the precedence to be total. However, for use in ATPs, 
the more symbols and thus also terms we can compare, the better. 
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module. The output of each module is differentiable with respect to the input almost 
everywhere. 

The standard modules include the fully connected layer, which performs an affine 
transformation, and non-linear activation functions such as the Rectified Linear Unit 
(ReLU) or sigmoid[]A fully connected layer with a single unit is called the linear unit. 

Some of the modules are parameterized by numeric parameters. For example, the 
fully connected layer that transforms the input x by the affine transformation W x + b is 
parameterized by the weight matrix W and the bias vector b. If the output of a module 
is differentiable with respect to a parameter, that parameter is considered trainable. 

In a typical scenario, the neural network is trained by gradient descent on a training 
set of examples. In such a setting, the network outputs a single numeric value called 
loss when evaluated on a batch of examples. The loss of a batch is typically computed 
as a weighted sum of the losses of the individual examples. Since each of the modules is 
differentiable with respect to its input and trainable parameters, the gradient of the loss 
with respect to all trainable parameters of the neural network can be computed using the 
back-propagation algorithm [12]. The trainable parameters are then updated by taking 
a small step against the gradient—in the direction that is expected to reduce the loss. 
An epoch is a sequence of iterations that updates the trainable parameters using each 
example in the training set exactly once. 

A graph convolutional network (GCN) is a special case of feedforward neural net- 
work. The modules of a GCN transform messages that are passed along the edges of 
a graph encoded in the input example. A particular architecture of a GCN used promi- 
nently in this work is discussed in Sect. B.2] 


3 Architecture 


A symbol precedence recommender is a system that takes a CNF problem P = (X, Cl) 
as the input, and produces a precedence 7* over the symbols X as the output. For the 
recommender to be useful, it should produce a precedence that likely leads to a quick 
search for a proof. In this work, we use the number of iterations of the saturation loop 
as a metric describing the effort required to find a proof. 

The recommender described in this section first uses a neural network to compute a 
cost value for each symbol of the input problem, and then orders the symbols by their 
costs in a non-increasing order. In this manner, the task of finding good precedences is 
reduced to the task of training a good symbol cost function, as discussed in Sect. A] 

The recommender consists of modules that perform specific sub-tasks, each of 
which is described in detail in one of the following sections (see also Fig. [Ip. 


3.1 Graph Constructor: From CNF to Graphs 


As the first step of the recommender processing pipeline, the input problem is converted 
from a CNF representation to a heterogeneous (directed) graph |41|. Each of the nodes 
of the graph is labeled with a node type, and each edge is labeled with an edge type, 


* These are, respectively, f(x) = max{0,a} and g(x) = = 
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Problem P 


(Graph constructor }» Graph GCN Symbol embeddings Output layer 


Precedence n 


Precedence p 


Loss function 
Loss value 


Symbol costs 


Precedence m* 


Fig. 1. Recommender architecture overview. When recommending a precedence, the input is 
problem P and the output is precedence 7*. When training, the input is problem P and prece- 
dences 7 and p, and the output is the loss value. The trainable modules and the edges along which 
the loss gradient is propagated are emphasized by bold lines. 


defining the heterogeneous nature of the graph. Each node corresponds to one of the 
elements that constitute the CNF formula, such as a clause, an atom, or a predicate 
symbol. Each such category of elements corresponds to one node type. The edges rep- 
resent the (oriented) relations between the elements, for example, the incidence relation 
between a clause and one of its (literals’) atoms, or the relation between an atom and 
its predicate symbol. R denotes the set of all relations in the graph. Figure[2|shows the 
types of nodes and edges used in our graph representation. Figure[3] shows an example 
of a graph representation of a simple problem. 
The graph representation exhibits, namely, the following properties: 


— Lossless: The original problem can be faithfully reconstructed from the correspond- 
ing graph representation (up to logical equivalence). 

— Signature agnostic: Renaming the symbols and variables in the input problem yields 
an isomorphic graph. 

— For each relation r € R, its inverse r7! is also present in the graph, typicallyrep- 
resented by a different edge type. 

— The polarity of the literals is expressed by the type of the edge (pos or neg) 
connecting the respective atom to the clause it occurs in. 

— For every non-equality atom and term, the order of its arguments is captured by a 
sequence of argument nodes chained by edges [27]. 

— The two operands of equality are not ordered. This reflects the symmetry of equal- 


ity. 
— Sub-expression sharing [8/26/27]: Identical atoms and terms share a node represen- 
tation. 


3.2 GCN: From Graphs to Symbol Embeddings 


For each symbol in the input problem P, we seek to find a vector representation, i.e., an 
embedding, that captures the symbol’s properties that are relevant for correctly ranking 
the symbol in the symbol precedences over P. 
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Fig. 2. CNF graph schema 


The symbol embeddings are output by a relational graph convolutional network 
(R-GCN) [82], which is a stack of graph convolutional layers. Each layer consists of 
a collection of differentiable modules—one module per edge type. The computation 
of the GCN starts with assigning each node an initial embedding and then iteratively 
updates the embeddings by passing them through the convolutional layers. 


The initial embedding AO of a node a is a concatenation of two vectors: a feature 
vector specific for that node (typically empty) and a trainable vector shared by all nodes 
of the same type. In our particular implementation, feature vectors are used in nodes 
that correspond to clauses and symbols. Each clause node has a feature vector with 
a one-hot encoding of the role of the clause, which can be either axiom, assumption, 
or negated conjecture [B836]. Each symbol node has a feature vector with two bits of 
data: whether the symbol was introduced into the problem during preprocessing (most 
notably during clausification), and whether the symbol appears in a conjecture clause. 


One pass through the convolutional layer updates the node embeddings by passing 
a message along each of the edges. For an edge of type r € R going from source node s 
to destination node d at layer l, the message is composed by converting the embedding 
of the source node Av using the module associated with the edge type r. In the simple 
case that the module is a fully connected layer with weight matrix wi? and bias vector 


po), the message is WOAY + oo, Each message is then divided by the normalization 


constant csa = y| NT |V IN| [18], where M? is the set of neighbors of node a under 


the relation r. 


Once all messages are computed, they are aggregated at the destination nodes to 
form new node embeddings. Each node d aggregates all the incoming messages of a 
given edge type r by summation, then passes the sum through an activation function 
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Fig. 3. Graph representation of the CNF formula a = b A f(a,b) 4 f(b, b) 


o such as the ReLU, and finally aggregates the messages across the edge types by 
summation, yielding the new embedding me, 
The following formula captures the complete update of the embedding of node d by 


layer I: 


1 
nie = 5y 5 5y — (WORO +o) 
rER \senz rete 


3.3 Output Layer: From Symbol Embeddings to Symbol Costs 


The symbol cost of each symbol is computed by passing the symbol’s embedding 
through a linear output unit, which is an affine transformation with no activation func- 
tion. 

It is possible to use a more complex output layer in place of the linear unit, e.g., 
a feedforward network with one or more hidden layers. Our experiments showed no 
significant improvement when a hidden layer was added, likely because the underlying 
GCN learns a sufficiently complex transformation. 

Let 0 denote the vector of all parameters of the whole neural network consist- 
ing of the GCN and the output unit. Given an input problem P with signature X = 
(S1,---,5n), we denote the cost of symbol s; predicted by the network as c(i, P; 0). 
In the rest of this text, we refer to the predicted cost of s; simply as c(i) because the 
problem P and the parameters 0 are fixed in each respective context. 
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3.4 Sort: From Symbol Costs to Precedence 


The symbol precedence heuristics commonly used in the ATPs sort the symbols by 
some numeric syntactic property that is inexpensive to compute, such as the number 
of occurrences in the input problem, or the symbol arity. In our precedence recom- 
mender, we sort the symbols by their costs c produced by the neural network described 
in Sects. B.2|and[3.3] An advantage of this scheme is that sorting is a fast operation. 

Moreover, as we show in Sect. A] it is possible to train the underlying symbol costs 
by gradient descent. 


4 Training Procedure 


In Sect. B]we described the structure of a recommender system that generates a symbol 
precedence for an arbitrary input problem. The efficacy of the recommender depends 
on the quality of the underlying symbol cost function c. In theory, the symbol cost 
function can assign the costs so that sorting the symbols by their costs yields an opti- 
mum precedence. This is because, at least in principle, all the information necessary to 
determine the optimum precedence is present in the graph representation of the input 
problem thanks to the lossless property of the graph encoding. Our approach to defining 
an appropriate symbol cost function is based on statistical learning from executions of 
an ATP on a set of problems with random precedences. 

To train a useful symbol cost function c, we define a precedence cost function C 
using the symbol cost function c in a manner that ensures that minimizing C corre- 
sponds to sorting the symbols by c. Finding a precedence that minimizes C can then be 
done efficiently and precisely. We proceed to train C on the proxy task of ranking the 
precedences. 


4.1 Precedence Cost 


We extend the notion of cost from symbols to precedences by taking the sum of the 
symbol costs weighted by their positions in the given precedence 7: 


C(m) = Za» i - c(z(i)) 


Zn = ICES is a normalization factor that ensures the commensurability of prece- 
dence costs across signature sizes. More precisely, normalizing by Z,, makes the ex- 
pected value of the precedence cost on a given problem independent of the problem’s 


signature size n, provided the expected symbol cost E;[c(i)] does not depend on n: 


2r [C (T)] = Ex 


-2(3>%) nO] = 2 Mt De ei] = Bile) 
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When C is defined in this way, the precedence produced by the recommender (see 
Sect.|3.4) minimizes C. 


Lemma 1. The precedence cost C is minimized by any precedence that sorts the sym- 
bols by their costs in non-increasing order: 


ee C(p) = argsort” (c(1),...,¢(n)) 


where argmin, C(p) is the set of all precedences that minimize precedence cost C for 
a given symbol cost c, and argsort™ (x) is the set of all permutations x that sort vector 
x in non-increasing order (£r(1) È Er(2) È +++» > Er(n)) 


Proof. We prove direction “argmin, C (p) C argsort™ (c(1),...,c(n))” by contradic- 
tion. Let 7 minimize C and let 7 not sort the costs in non-increasing order. Then there 
exist k < l such that c(7(k)) < c(m(l)). Let 7 be a precedence obtained from 7 by 
swapping the elements k and l. Then we obtain 


ae = kc(7(k)) + le(a(1)) — ke(a(k)) — le(n(l)) 
= ke(n(l)) + le(n(k)) — ke(r(k)) — lela (l)) 
= k(e(a(1)) — e(a(k))) — Ue) — e(r(k))) 
= (k —D(e(a(D) — e(a (k))) 
<0 


The final inequality is due to k —1 < 0 and c(m(l)) — c(a(k)) > 0. Clearly, Zn > 0 for 
any n > 0. Thus, C(7) < C(), which contradicts the assumption that 7 minimizes C. 

To prove the other direction of the equality, first observe that all precedences 7 
that sort the symbol costs in a non-increasing order necessarily have the same prece- 
dence cost C(7). Since Ø # argmin, C(p) C argsort™(c(1),...,¢(n)), each of 
the precedences in argsort™ (c(1),...,c(m)) has the cost min, C'(). It follows that 
argsort™ (c(1),...,e(m)) C argmin, C(p). 


4.2 Learning to Rank Precedences 


Our ultimate goal is to train the precedence cost function C so that it is minimized by 
the best precedence, measuring the quality of a precedence by the number of iterations 
of the saturation loop taken to solve the problem. 

Approaching this task directly, as a regression problem, runs into the difficulty of 
establishing sensible target cost values for the precedences in the training dataset, es- 
pecially when a wide variety of input problems is covered. Approaching the task as 
a binary classification of precedences seems possible, but it is not clear which prece- 
dences should be a priori labeled as positive and which as negative, to give a guarantee 
that a precedence minimizing the precedence cost (i.e. the one obtained by sorting) 
would be among the best in any good sense. 

We cast the task as an instance of score-based ranking problem by training 
a Classifier to decide which of a pair of precedences is better based on their costs. We 
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train the classifier in a way that ensures that better precedences are assigned lower costs. 
The motivation for learning to order pairs of precedences is that it allows learning on 
easy problems, and that it may allow the system to generalize to precedences that are 
better than any of those seen during training. 


Training Data. Each training example has the form (P, 7, p), where P = (X, Cl) is 
a problem and 7, p are precedences over X such that the prover using m solves P in 
fewer iterations of the saturation loop than with p, denoted as 7 <p p. 


Loss Function. Let (P, m, p) be a training example (m <p p). The precedence cost 
classifies this example correctly if C(7) < C(p), or alternatively S(a,p) = C(p) — 
C(m) > 0. We approach this problem as an instance of binary classification with the 
logistic loss [23], a loss function routinely used in classification tasks in machine learn- 
ing: 


L(P, 7, p) = — log sigmoid S(r, p) = — log sigmoid(C(p) — C(7)) 
= — log sigmoid Zn X i(e(p(i)) — e((é))) 


i=l 


Note that the classifier cannot simply train S to output a positive number on all pairs 
of precedences because S is defined as a difference of two precedence costs. Intuitively, 
by training on the example (P, 7, p) we are pushing C'(7) down and C(p) up. 

The loss function is clearly differentiable with respect to the symbol costs, and the 
symbol cost function c is differentiable with respect to its trainable parameters. This 
enables the use of gradient descent to find the values of the parameters of c that locally 
minimize the loss value. 

Figure[I|shows how the loss function is plugged into the recommender for training. 


5 Experimental Evaluation 


To demonstrate the capacity of the trainable precedence recommender described in 
Sects.[3]and|4] we performed a series of experiments. In this section, we describe the de- 
sign and configuration of the experiments, and then compare the performance of several 
trained models to a baseline heuristic. 

The scripts that were used to generate the training data and to train and evaluate the 
recommender are available online|] 


5.1 Environment 


System. All experiments were run on a computer with the CPU Intel Xeon Gold 6140 
(72 cores @ 2.30 GHz) and 383 GiB RAM. 


SIhttps://github.com/filipbartek/vampire-ml/tree/cade28 
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Solver. The empirical evaluation was performed using a modified version of the ATP 
Vampire 4.3.0 [21]. The prover was used to generate the training data and to evalu- 
ate the trained precedence recommender. To generate the training data, Vampire was 
modified to output CNF representations of the problems and annotated problem signa- 
tures in a machine-readable format. For the evaluation of the precedences generated by 
the recommender, Vampire was modified to allow the user to supply explicit predicate 
and function symbol precedences for the proof search (normally, the user only picks a 
precedence generation heuristic). The modified version of Vampire is available online|] 

We run Vampire with a fixed strategy!’ Jand a time limit of 10 seconds. To increase the 
potential impact of predicate precedences, we used a simple transfinite Knuth-Bendix 
ordering (TKBO) that compares atoms according to the predicate precedence 
first, using the regular KBO to break ties between atoms and to compare terms (using 
the Vampire option --literal_comparison_mode predicate). 


5.2 Dataset Preparation 


The training data consists of examples of the form (P, 7, p), where P is a CNF problem 
and 7, p are precedences of symbols of problem P such that out of the two precedences, 
m yields a proof in fewer iterations of the saturation loop (see Sect. 2.1). 

Since the TKBO never compares a predicate symbol with a function symbol, two 
separate precedences can be considered for each problem: a predicate precedence and a 
function precedence. We trained a predicate precedence recommender separately from 
a function precedence recommender to simplify the training process and to isolate the 
effects of the predicate and function precedences. This section describes how the train- 
ing data for the case of training a predicate precedence recommender was generated. 
Data for training the function precedence recommender was generated analogously. 


Base Problem Set. The input problems were assumed to be specified in the CNF or the 
first-order form (FOF) fragment of the TPTP language [86]. FOF problems were first 
converted into equisatisfiable CNF problems by Vampire. 

We used the problem library TPTP v7.4.0 as the source of problems for training 
and evaluation of the recommender. We denote the set of all problems available for 
training and evaluation as Po (|Po| = 17 053). 


Node Feature Extraction. In addition to the signature and the structure of the prob- 
lem, some metadata was extracted from the input problem to allow training a more 
efficient recommender. First, each clause was annotated with its role in the problem, 
which could be either axiom, assumption, or negated conjecture. Second, each sym- 
bol was annotated with two bits of data: whether the symbol was introduced into the 
problem during preprocessing, and whether the symbol appeared in a conjecture clause. 
This metadata was used to construct the initial embeddings of the respective nodes in 
the graph representation of the problem (see Sect. 8.2}. 


Shttps://github.com/filipbartek/vampire/tree/cade28 


7 Saturation algorithm: DISCOUNT, age to weight ratio: 1:10, AVATAR : disabled, literal 
comparison mode: predicate; all other options left at their default values. 
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Examples Generation. The examples were generated by an iterative sampling of Po. 
In each iteration, a problem P € Po was chosen and Vampire was executed twice on 
P with two (uniformly) random predicate precedences and one common random func- 
tion precedence. The “background” random function precedence served as additional 
noise (in addition to the variability contained in TPTP) and made sure that the predicate 
precedence recommender would not be able to rely on any specificity that would come 
from fixing function precedences in the training data. 

The two executions were compared in terms of performance: the predicate prece- 
dence 7 was recognized as better than the predicate precedence p, denoted as 7 <p p, 
if the proof search finished successfully with m and if the number of iterations of the 
saturation loop with 7 was smaller than with p. If one of the two precedences was 
recognized as better, the example (P, 7, p) would be produced, where m was the better 
precedence, and p was the other precedence. Otherwise, for example, if the proof search 
timed out on both precedences, we would go back to sampling another problem. 

To ensure the efficiency of the sampling, we interpreted the process as an instance 
of the Bernoulli multi-armed bandit problem [37], with the reward of a trial being 1 in 
case an example is produced, and 0 otherwise. 

We employed adaptive sampling to balance exploring problems that have been tried 
relatively scarcely and exploiting problems that have yielded examples relatively often. 
For each problem P € Po, the generator kept track of the number of times the problem 
has been tried n p, and the number of examples generated from that problem s p. The ra- 
tio #2 z; corresponded to the average reward of problem P observed so far. The problems 
were ‘sampled using the allocation strategy UCB1 |I] with a parallelizing relaxation. 

First, the values of np and sp for each problem P were bootstrapped by sampling 
the problem a number of times equal to a lower bound on the final value of np (at least 
1) S\In each subsequent iteration, the generator sampled the problem P that maximized 


ao tiy aan , where n = >) pep, np Was the total number of tries on all problems. 


The paralletizins relaxation means that the sp values were only updated once in 1000 
iterations, allowing up to 2000 parallel solver executions. 

The sampling continued until 1000000 examples were generated when training 
a predicate precedence recommender, or 800 000 examples in the case of a function 
precedence recommender. For example, while generating 1000000 examples for the 
predicate precedence dataset, 5349 out of the 17053 problems yielded at least one ex- 
ample, while the least explored problem was tried 19 times, and the most exploited 
problem 504 times. 


Validation Split. The 17053 problems in Po were first split roughly in half to form the 
training set and the validation set. Next, both training and validation sets were restricted 
to problems whose graph representation consisted of at most 100 000 nodes to limit the 
memory requirements of the training. Approximately 90 % of the problems fit into this 
limit and there were 7648 problems in the resulting validation set Py). The training 


2log N 
21 N|P. 
1+y oz N] ol)2 


8 The number of tries each problem was bootstrapped with is no = f |, where 
( 


N is the final number of examples to be generated. For example, if N = 1000000 and 
|Po| =y 053, then no = 10. 
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set Pirain Was further restricted to problems that correspond to at least one training ex- 
ample, resulting in 2571 problems when training a predicate precedence recommender, 
and 1953 problems when training a function precedence recommender. 


5.3 Hyperparameters 


We used a GCN described in Sect. B.2]with depth 4, message size 16, ReLU activation 
function, skip connections [41], and layer normalization [2]. We tuned the hyperparam- 
eters by a small manual exploration. 


5.4 Training Procedure 


A symbol cost model was trained by gradient descent on the precedence ranking task 
(see Sect. using the examples generated from Ptrain. To avoid redundant com- 
putations, all examples generated from any given problem were processed in the same 
training batch. Thus, each training batch contained up to 128 problems and all examples 
generated from these problems. The symbol cost model was trained using the Adam op- 
timizer [I7]. The learning rate started at 1.28 x 1073 and was halved each time the loss 
on Ptrain stagnated for 10 consecutive epochs. 

The examples were weighted. Each of the examples of problem P contributed to the 
training with the weight za where sp was the number of examples of problem P in 
the training set. This ensured that each problem contributed to the training to the same 
degree irrespective of the relative number of examples. 

We continued the training until the validation accuracy stopped increasing for 100 
consecutive epochs. 


5.5 Final Evaluation 


After the training finished, we performed a final evaluation of the most promising in- 
termediate trained model on the whole Pya. The model that manifested the best solver 
performance on a sample of 1000 validation problems was taken as the most promising. 


5.6 Results 


A predicate precedence recommender was trained on approximately 500000 exam- 
ples, and a function precedence recommender was trained on approximately 400 000 
examples. For each problem P € Pyaj, a predicate and a function precedences were 
generated by the respective trained recommender, and Vampire was run using these 
precedences with a wall clock time limit of 10 seconds. The results are averaged over 5 
runs to reduce the effect of noise due to the wall clock time limit. As a baseline, the per- 
formance of Vampire with the frequency precedence heuristiq?| was evaluated with 
the same time limit. For comparison, the two trained recommenders were evaluated sep- 
arately, with the predicate precedence recommender using the frequency heuristic 
to generate the function precedences, and vice versa. 


° This is Vampire’s analogue of the invfreq scheme in E (33). 
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To generate a precedence for a problem, the recommender first converts the prob- 
lem to a machine-friendly CNF format, then converts the CNF to a graph, then predicts 
symbol costs using the GCN model and finally orders the symbols by their costs to pro- 
duce the precedence. To simplify the experiment, the time limit of 10 seconds was only 
imposed on the Vampire run, excluding the time taken by the recommender to generate 
the precedence. When run with 2 threads, the preprocessing of a single problem took 
at most 1.26 seconds for 80 % of the problems by extrapolation from a sample of 1000 
problems|"|Table[1|shows the results of the final evaluation. 


Table 1. Results of the evaluation of symbol precedence heuristics based on various symbol cost 
models on Pyai (|Pvai| = 7648). Means and standard deviations over 5 runs are reported. The 
GCN models were trained according to the description in Sects. [3] to 5] The model Simple is 
the final linear model from our previous work [6]. The models that used machine learning only 
for the predicate precedence used the frequency heuristic for the function precedence, and 
vice versa. The frequency model uses the standard frequency heuristic for both predicate and 
function precedence. 


Symbol cost model Successes on Pya] Improvement over baseline 
Mean Std | Absolute Relative 

GCN (predicate and function) |3951.6 1.62 +182.0 1.048 

GCN (predicate only) 3923.6 2.24 +154.0 1.041 

GCN (function only) 3874.2 1.83 +104.6 1.028 

Simple (predicate only) 3827.2 1.94 +57.6 1.015 

Frequency (baseline) 3769.6 3.07 0.0 1.000 


The results show that the GCN-based model outperformed the frequency heuris- 
tic by a significant margin. Since the predicate precedence recommender was trained 
with randomly distributed function precedences, it was expected to perform well irre- 
spective of the function precedence heuristic it is combined with, and conversely. Com- 
bining the trained recommenders for predicate and function precedences manifested 
better performance than any of the two in combination with the standard frequency 
heuristic, outperforming the frequency heuristic by approximately 4.8 %. 

We have confirmed our earlier conjecture [6] that using a graph neural network 
(GNN) may outperform the “simple” linear predicate precedence heuristic trained in 


aE 


6 Related Work 


Our previous text [6] marked the initial investigation of applying techniques of machine 
learning to generating good symbol precedences. The neural recommender presented 
here uses a GNN to model symbol costs, while [6] used a linear combination of symbol 
features readily available in the ATP Vampire. The GNN-based approach yields more 
performant precedences at the cost of longer training and preprocessing time. 


10 The remaining 20 % of the problems either finished preprocessing within 5 seconds, or were 
omitted from preprocessing due to exceeding the node count limit. 

1! The measurements presented in Table[i]are not directly comparable with those reported in [6] 
due to differences in the validation problem sets and the computation environments. 
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In [26], and [27], the authors propose similar GNN architectures to solve tasks 
on FOL problems. They use the GNNs to solve classification tasks such as premise 
selection. While our system is trained on a proxy classification task, the main task it is 
evaluated on is the generation of useful precedences. 

The problem of learning to rank objects represented by scores trainable by gradient 
descent was explored in |7]. Our work can be seen to apply the approach of [7] to rank 
permutations represented by weighted sums of symbol costs. 


7 Conclusion and Future Work 


We have described a system that extracts useful symbol precedences from the graph 
representations of CNF problems. Comparison with a conventional symbol precedence 
heuristic shows that using a GCN to consider the whole structure of the input problem 
is beneficial. 

A manual analysis of the trained recommender could produce new insights into 
how the choice of the symbol precedence influences the proof search, which could in 
turn help design new efficient precedence generating schemes. Indeed, a trained cost 
model summarizes the observed behaviors of an ATP with random precedences and 
is able to discover patterns in them (as we know implicitly from its accuracy) despite 
their seemingly chaotic behavior as perceived by a human observer. The challenge is to 
extract these patterns in a human-understandable form. 

In addition to the symbol precedence, KBO is determined by symbol weights. In this 
work, we keep the symbol weights fixed to the value 1. Learning to recommend sym- 
bol weights in addition to the precedences represents an interesting avenue for future 
research. 

The same applies to the idea of learning to recommend both the predicate and func- 
tion precedences using a single GCN. The joint learning, although more complex to 
design, could additionally discover interdependencies between the effects of function 
precedence and predicate precedence on the proof search, while the current setup im- 
plicitly assumes that the effects are independent. Finally, a higher training data effi- 
ciency could be achieved by considering all pairs of measured executions on a problem 
in one training batch. 
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Abstract. We re-examine the topic of machine-learned clause selection 
guidance in saturation-based theorem provers. The central idea, recently 
popularized by the ENIGMA system, is to learn a classifier for recogniz- 
ing clauses that appeared in previously discovered proofs. In subsequent 
runs, clauses classified positively are prioritized for selection. We pro- 
pose several improvements to this approach and experimentally confirm 
their viability. For the demonstration, we use a recursive neural network 
to classify clauses based on their derivation history and the presence 
or absence of automatically supplied theory axioms therein. The auto- 
matic theorem prover Vampire guided by the network achieves a 41% 
improvement on a relevant subset of SMT-LIB in a real time evaluation. 


Keywords: Saturation-based theorem proving - Clause Selection - Ma- 
chine Learning - Recursive Neural Networks. 


1 Introduction 


The idea to improve the performance of saturation-based automatic theorem 
provers (ATPs) with the help of machine learning (ML), while going back at 
least to the early work of Schulz [8,30], has recently been enjoying a renewed 
interest. Most notable is the ENIGMA system [16,17] extending the ATP E [31] 
by machine learned clause selection guidance. The architecture trains a binary 
classifier for recognizing as positive those clauses that appeared in previously 
discovered proofs and as negative the remaining selected ones. In subsequent 
runs, clauses classified positively are prioritized for selection. 

A system such as ENIGMA needs to carefully balance the expressive power 
of the used ML model with the time it takes to evaluate its advice. For example, 
Loos et al. [22], who were the first to integrate state-of-the-art neural networks 
with E, discovered their models to be too slow to simply replace the traditional 
clause selection mechanism. In the meantime, the data-hungry deep learning ap- 
proaches motivate researchers to augment training data with artificially crafted 
theorems [1]. Yet another interesting aspect is what features we allow the model 
to learn from. One could speculate that the recent success of ENIGMA on the 
Mizar dataset [7,18] can at least partially be explained by the involved prob- 
lems sharing a common source and encoding. It is still open whether some new 
form of general “theorem proving knowledge” could be learned to improve the 
performance of an ATP across, e.g., the very diverse TPTP library. 


© The Author(s) 2021 
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In this paper, we propose several improvements to ENIGMA-style clause 
selection guidance and experimentally test their viability in a novel setting: 


— We lay out a set of possibilities for integrating the learned advice into the 
ATP and single out the recently developed layered clause selection [10, 11,36] 
as particularly suitable for the task. 

— We speed up evaluation by a new lazy evaluation scheme under which many 
generated clauses need not be evaluated by the potentially slow classifier. 

— We demonstrate the importance of “positive bias”, i.e., of tuning the classifier 
to rather err on the side of false positives than on the side of false negatives. 

— Finally, we propose the use of “negative mining” for improving learning from 
proofs obtained while relying on previously learned guidance. 


To test these ideas, we designed a recursive neural network to classify clauses 
based solely on their derivation history and the presence or absence of automati- 
cally supplied theory axioms therein. This allows us to test here, as a byproduct 
of the conducted experiments, whether the human-engineered heuristic for con- 
trolling the amount of theory reasoning presented in our previous work [11] can 
be matched or even overcome by the automatically discovered neural guidance. 
The rest of the paper is structured as follows. Sect. 2 recalls the necessary 
ATP theory, explains clause selection and how to improve it using ML. Sect. 3 
covers layered clause selection and the new lazy evaluation scheme. In Sect. 4, we 
describe our neural architecture and in Sect. 5 we bring everything together and 
evaluate the presented ideas, using the prover Vampire as our workhorse and a 
relevant subset of SMT-LIB as the testing grounds. Finally, Sect. 6 concludes. 


2 ATPs, Clause Selection, and Machine Learning 


The technology behind the modern automatic theorem provers (ATPs) for first- 
order logic (FOL), such as E [31], SPASS [40], or Vampire [21], can be roughly 
outlined by using the following three adjectives. 


Refutational: The task of the prover is to check whether a given conjecture G 
logically follows from given axioms Aj,...,A,, i.e. whether 


Aj,..., An EG, (1) 


where G and each A; are FOL formulas. The prover starts by negating the con- 
jecture G and transforming =G, A,,..., An into an equisatisfiable set of clauses 
C. It then applies a sound logical calculus to iteratively derive further clauses, 
logical consequence of C, until the obvious contradiction in the form of the empty 
clause L is derived. This refutes the assumption that AG, A,,..., An could be 
satisfiable and thus confirms (1). 


Superposition-based: The most popular calculus used in this context is super- 
position [3,23], an extension of ordered resolution [4] with a built-in support for 
handling equality. It consists of several inference rules, such as the resolution 
rule, factoring, subsumption, superposition, or demodulation. 
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Inference rules in general determine how to derive new clauses from old ones, 
where by old clauses we mean either the initial clauses C or clauses derived pre- 
viously. The clauses that need to be present for a rule to be applicable are called 
the premises and the newly derived clause is called the conclusion. By applying 
the inference rules the prover gradually constructs a derivation, a directed acyclic 
(hyper-)graph (DAG), with the initial clauses forming the leaves and the derived 
clauses (labeled by the respective applied rules) forming the internal nodes. A 
proof is the smallest sub-DAG of a derivation containing the final empty clause 
and for every derived clause the corresponding inference and its premises. 


Saturation-based: A saturation algorithm is the concrete way of organizing the 
process of deriving new clauses, such that every applicable inference is eventually 
considered. Modern saturation-based ATPs employ some variant of the given- 
clause algorithm, in which clauses are selected for inferences one by one [27]. 

The process employs two sets of clauses, often called the active set A and the 
passive set P. At the beginning all the initial clauses are put to the passive set. 
Then in every iteration, the prover selects and removes a clause C from P, inserts 
it into A, and performs all the applicable inferences with premises in A such that 
at least one of the premises is C. The conclusions of these inferences are then 
inserted into P. This way the prover maintains (at the end of each iteration) the 
invariant that inferences among the clauses in the active set have been performed. 
The selected clause C is sometimes also called the “given clause”. 

During a typical prover run, P grows much faster than A (the growth is 
roughly quadratic). Analogously, although for different reasons, when a proof is 
discovered, its clauses constitute only a fraction of A. Notice that every clause 
C € A that is in the end not part of the proof did not need to be selected and 
represents a wasted effort. This explains why clause selection, i.e. the procedure 
for picking in each iteration the next clause to process, is one of the main heuristic 
decision points in the prover, which hugely affects its performance [32]. 


2.1 Traditional Approaches to Clause Selection 


There are two basic criteria that have been identified as generally correlating 
with the likelihood of a clause contributing to the yet-to-be discovered proof. 

One is clause’s age or, more precisely, its “date of birth”, typically imple- 
mented as an ever increasing timestamp. Preferring for selection old clauses to 
more recently derived ones corresponds to a breadth-first strategy and ensures 
fairness. The other criterion is clause’s size, referred to as weight in the ATP 
lingo, and is realized by some form of symbol counting. Preferring for selection 
small clauses to large ones is a greedy strategy, based on the observation that 
small conclusions typically belong to inferences with small premises and that the 
ultimate conclusion—the empty clause—is the smallest of all. The best results 
are achieved when these two criteria (or their variations) are combined [32]. 

To implement efficient clause selection by numerical criteria such as age and 
weight, an ATP represents the passive set P as a set of priority queues. A 
queue contains (pointers to) the clauses in P ordered by its respective criterion. 
Selection typically alternates between the available queues under a certain ratio. 
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A successful strategy is, for instance, to select 10 clauses by weight for every 
clause selected by age, i.e., with an age-to-weight ratio of 1:10. 


2.2 ENIGMA-style Machine-Learned Clause Selection Guidance 


The idea to improve clause selection by learning from previous prover experience 
goes, to the best of our knowledge, back to Schulz [8,30] and has more recently 
been successfully employed by the ENIGMA system and others [7, 15-17, 22]. 

The experience is collected from successful prover runs, where each selected 
clause constitutes a training example and the example is marked as positive, if 
the clause ended-up in the discovered proof, and negative otherwise. A machine 
learning (ML) algorithm is then used to fit this data and produce a model M 
for classifying clauses into positive and negative, accordingly. A good learning 
algorithm produces a model M which not only accurately classifies the training 
data but also generalizes well to unseen examples. The computational costs of 
both training and evaluation are also important. 

While clauses are logical formulas, i.e., discrete objects forming a countable 
set, ML algorithms, rooted in mathematical statistics, are primarily equipped 
to dealing with fixed-seized real-valued vectors. Thus the question of how to 
represent clauses for the learning is the first obstacle that needs to be overcome, 
before the whole idea can be made to work. In the beginning, the authors of 
ENIGMA experimented with various forms of hand-crafted numerical clause 
features [16,17]. An attractive alternative explored in later work [7,15,22] is the 
use of artificial neural networks, which can be understood as extracting the most 
relevant features automatically. 

An important distinction can in both cases be made between approaches 
which have access to the concrete identity of predicate and function symbols (i.e., 
the signature) that make up the clauses, and those that do not. For example: 
Is the ML algorithm allowed to assume that the symbol grp_mult is used to 
represent the multiplication operation in a group or does it only recognize a 
general binary function? The first option can be much more powerful, but we 
need to ensure that the signature symbols are aligned and used consistently 
across the problems in our benchmark. Otherwise the learned advice cannot 
meaningfully cary over to previously unsolved problems. While the assumption 
of aligned signature has been employed by the early systems [16, 22], the most 
recent version of ENIGMA [15,24] can work in a “signature agnostic” mode. 

In this work we represent clauses solely by their derivation history, deliber- 
ately ignoring their logical content. Thus we do not require the assumption of 
an aligned signature, per se. However, we rely on a fixed set of distinguished 
axioms to supply features in the derivation leaves. 


2.3 Integrating the Learned Advice 


Once we have a trained model M, an immediate possibility for integrating it 
into the clause selection procedure is to introduce a new queue that will order 
the clauses using M. Two basic versions of this idea have been described: 


Improving ENIGMA-style Clause Selection while Learning From History 547 


“Priority”: The ordering puts all the clauses classified by M as positive before 
those classified negatively. Within the two classes, older clauses are preferred. 


Let us for the purposes of future reference denote this scheme M1!°. It has 
been successfully used by the early ENIGMA s [7, 16, 17]. 


“Logits”: Even models officially described as binary classifiers typically inter- 
nally compute a real-valued estimate L of how much “positive” or “negative” an 
example appears to be and only turn this estimate into a binary decision in the 
last step, by comparing it against a fixed threshold t, most often 0. A machine 
learning term for this estimate L is the logit. 

The second version orders the clauses on the new queue by the “raw” logits 
produced by a model. We denote it MTE to stress that clauses with high L are 
treated as small from the perspective of the selection and therefore preferred. 
This scheme has been used by Loos et al. [22] and in the latest ENGIMA [15,37]. 


Combining with a traditional strategy. While it is possible to rely exclusively 
on selection governed by the model, it turns out to be better [7] to combine it 
with the traditional heuristics. The most natural choice is to take S, the original 
strategy that was used to generate the training data, and extend it by adding 
the new queue, be it Mt? or MTP, next to the already present queues. We 
then again supply a ratio under which the original selection from S and the new 
selection based on M get alternated. We will denote this kind of combination 
with the original strategy as S@ M1” and S 6 MTBE, respectively. 


3 Layered Clause Selection and Lazy Model Evaluation 


Layered clause selection (LCS) is a recently developed method [10, 11,36] for 
smoothly incorporating a categorical preference for certain clauses into a base 
clause selection strategy S. In this paper, we will readily use it in combination 
with the binary classifier advice from a trained model M. 

When we instantiate LCS to our particular case,” its function can be sum- 
marized by the expression 


S@S|M}). 


In words, the base selection strategy S is alternated with S[M'], the same 
selection scheme S but applied only to clauses classified positively by M. Implicit 
here is a convention that whenever there is no positively classified passive clause, 
a fallback to plain S occurs. Additionally, we again specify a “second-level” ratio 
to govern the alternation between pure S and S[M1J. 

The main advantage of LCS, compared to the options outlined in the previous 
section, is that the original, typically well-tuned, base selection mechanism S is 
also applied to Mt, the clauses classified positively by M. 


1 A logit can be turned into a (formal) probability, i.e. a value between 0 and 1, by 
passing it, as is typically done, through the sigmoid function o(x) = 1/(1 +e”). 
2 We rely here on the monotone mode of split; there is also a disjoint mode [10]. 
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3.1 Lazy Model Evaluation 


It is often the case that evaluating a clause by the model M is a relatively 
expensive operation [22]. As we explain here, however, this operation can be 
avoided in many cases, especially when using LCS to integrate the advice. 

We propose the following lazy evaluation approach to be used with S6S[M']. 
Every clause entering the passive set P is initially inserted to both S and S|M!1] 
without being evaluated by M. Then, whenever (as governed by the second-level 
ratio) it is the moment to select a clause from S[M1], the algorithm 


1. picks (as usual, according to S) the best clause C in S[M'], 
2. only then evaluates C by M, and 
3. if C gets classified as negative, it forgets C, a goes back to 1. 


This repeats until the first positively classified clause is found, which is then 
returned. Note that this way the “observable behaviour” of S[M'] is preserved. 

The power of lazy evaluation lies in the fact that not every clause needs to be 
evaluated before a proof is found. Indeed, recall the remark that the passive set 
P is typically much larger than the active set A, which also holds on a typical 
successful termination. Every clause left in passive at that moment is a clause 
that did not need to be evaluated by M thanks to lazy evaluation. 

We remark that lazy evaluation can similarly be used with the integration 
mode M1 based on priorities. 

We experimentally demonstrate the effect of the technique in Sect. 5.4. 


4 A Neural Classification of Clause Derivations 


In this work we choose to represent a clause, for the purpose of learning, solely by 
its derivation history. Thus a clause can only be distinguished by the axioms from 
which it was derived and by the precise way in which these axioms interacted 
with each other through inferences in the derivation. This means we deliberately 
ignore the clause’s logical content. 

We decided to focus on this representation, because it promises to be fast. 
Although an individual clause’s derivation history may be large, it is a sim- 
ple function of its parents’ histories (just one application of an inference rule). 
Moreover, before a clause with a complicated history can be selected, most of its 
ancestors will have been selected already.? This guarantees the amortised cost 
of evaluating a single clause to be constant. 

A second motivation comes from our recent work [11], where we have shown 
that theory reasoning facilitated by automatically adding theory axioms for ax- 
iomatising theories, while in itself a powerful technique, often leads the prover 
to unpromising parts of the search space. We developed a heuristic for control- 
ling the amount of theory reasoning in the derivation of a clause [11]. Our goal 
here is to test whether a similar or even stronger heuristic can be automatically 
discovered by a neural network. 


3 Exceptions are caused by simplifying inferences applied eagerly outside of the gov- 
ernance of the main clause selection mechanism. 
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Examples of axioms that Vampire uses to axiomatise theories include the 
commutativity or associativity axioms for the arithmetic operations, an axiom- 
atization of the theory of arrays [6] or of the theory of term algebras [20]. For us 
it is mainly important that the axioms are introduced internally by the prover 
and can therefore be consistently identified across individual problems. 


4.1 Recursive Neural Networks 


A recursive neural network (RvNN) is a network created by recursively compos- 
ing a finite set of neural building blocks over a structured input [12]. A general 
neural block is a function Ng : RE —> R’! depending on a vector of parameters 0 
that can be optimized during training (see below in Section 4.3). 

In our case, the structured input is a clause derivation, i.e. a DAG with nodes 
identified with the derived clauses. To enable a recursion, an RvNN represents 
each node C by a real vector vc (of a fixed dimension n) called a (learnable) 
embedding. During training a network learns to embed the space of derivable 
clauses into R” in some a priori unknown, but still useful way. 

We assume that each initial clause C, a leaf of the derivation DAG, is labeled 
as belonging to one of the automatically added theory axioms or coming from 
the user input. Let these labels form a finite set of axiom origin labels L4. 
Furthermore, let the applicable inference rules that label the internal nodes of 
the DAG form a finite set of inference rule labels Lr. The specific building blocks 
of our neural architecture are the following three (indexed families of) functions: 


— for every axiom label / € £4, a nullary init function Z, € R” which to an 
initial clause C labeled by l assigns its embedding vc := fi, 

— for every inference rule r € Lp, a deriv function, Dp : R” x --- x R” > R” 
which to a conclusion clause Ce derived by r from premises (C),..., Cx) with 
embeddings vc,,..., Uc, assignes the embedding vc, := D,(vc,,---;UC;,); 

— and, finally, a single eval function E : R” — R which evaluates an embedding 
vc such that the corresponding clause C is classified as positive whenever 
E(vc) > t, with the threshold ¢ set, by default, to 0. 


By recursively composing the init and deriv functions, any derived clause C 
can be assigned an embedding vc and also evaluated by E to see whether the 
network recommends it as positive, that should be preferred in proof search. 


4.2 Architecture Details 


Here we outline the details of our architecture for the benefit of neural network 
practitioners. All the used terminology is standard (see, e.g., [13]). 

We realized each init function J; as an independent learnable vector. Similarly, 
each deriv function D, was independently defined. For a rule of arity two, such 
as resolution, we used: 


D,(v1, v2) = LayerNorm(y), y = W3 - x + b3, x = ReLU(WY - [v1, v2] + 07), 


where [-,-] denotes vector concatenation, ReLU is the rectified linear unit non- 
linearity (f(x) = max{0,z}) applied component-wise, and the learnable matrices 
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W7,W2 and vectors bj, b% are such that x € R?” and y € R”. (We took inspira- 
tion from Sandler et al. [29] for doubling the embedding size before applying the 
non-linearity.) Finally, LayerNorm is a layer normalization [2] module, without 
which training often became numerically unstable for deeper derivation DAGs.* 

For unary inference rules, such as factoring, we used an equation analogous 
to the above, except for the concatenation operation. We did not need to model 
an inference rule with a variable number of premises, but one option would be 
to arbitrarily “bracket” its arguments into a tree of binary applications. 

Finally, the eval function was E(v) = W2:ReLU(W,-v +b) +c with trainable 
W, € R”*” b € R”, Wo € R!*”, and cE R. 


4.3 Training the Network 


To train a network means to find values for the trainable parameters such that 
it accurately classifies the training data and ideally also generalises to unseen 
future cases. We follow a standard methodology for training our RvNN. 

In particular, we use the gradient descent (GD) optimization algorithm (with 
the Adam optimiser [19]) minimising the typical binary cross-entropy loss, com- 
posed as a sum of contributions, for every selected clause C', of the form 


—yc : log(o(E(vc))) — (1 — yc) : log(1 — o(E(ve))), 


with yc = 1 for the positive and yc = 0 for the negative examples. 

These contributions are weighted such that each derivation DAG (corre- 
sponding to a prover run on a single problem) receives equal weight. Moreover, 
within each DAG we re-scale the influence of positive versus the negative exam- 
ples such that these two categories contribute evenly. The scaling is important 
as our training data is highly unbalanced (cf. Sect. 5.1). 

We split the available successful derivations into a training set and a valida- 
tion set, and only train on the first set using the second to observe generalisation 
to unseen examples. As the GD algorithm progresses, iterating over the training 
data in rounds called epochs, we evaluate the loss on the validation set and stop 
the process early if this loss does not decrease for a specified period. This early 
stopping criterion was important to produce a model that generalizes well. 

As another form of regularisation, i.e. a technique for preventing overfitting 
to the training data, we employ dropout [35] (independently for each “read” of 
a clause embedding by one of the deriv or eval functions). Dropout means that 
at training time each component v; of the embedding v has a certain probability 
of being zero-ed out. This “voluntary brain damage” makes the network more 
robust as it prevents neurons from forming too complex co-adaptations [35]. 

Finally, we experimented with using non-constant learning rates as suggested 
by Smith et al. [33,34]. In the end, we used a schedule with a linear warmup for 
the first 50 epochs followed by a hyperbolic cooldown [38] (cf. Fig. 1 in Sect. 5.2). 


4 We also tried to skip LayerNorm and replace ReLU by the hyperbolic tangent 
function. This restores stability, but does not train or classify so well. 
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4.4 An Abstraction for Compression and Caching 


Since our representation of clauses deliberately discards information, we end 
up encountering distinct clauses indistinguishable from the perspective of the 
network. For example, every initial clause C originating from the input problem 
(as opposed to being added as a theory axiom) receives the same embedding 
vc = linput. Indistinguishable clauses also arise as conclusions of an inference 
that can be applied in more than one way to certain premises. 

Mathematically, we deal with an equivalence relation ~ on clauses based on 
“having the same derivation tree”: Cı ~ Cy © derivation(C,) = derivation(C2). 
The “fingerprint” derivation(C) of a clause could be defined as a formal expres- 
sion recording the derivation history of C using the labels from £4 as nullary 
operators and those from £p as operators with arities of the corresponding in- 
ference rules. For example: Resolution(thax_inverse_assoc, Factoring(input)). 

We made use of this equivalence in our implementation in two places: 


1. When preparing the training data. We “compressed” each derivation DAG 
as a factorisation by ~, keeping only one representative of each class. A class 
containing a positive example was marked as a positive example. 

2. When interfacing the trained model from the ATP. We cached the embed- 
dings (and evaluated logits) for the already encountered clauses under their 
class identifier. Sect. 5.4 evaluates the effect of this technique. 


5 Experiments 


We implemented the infrastructure for training an RvNN clause derivation clas- 
sifier (as described in Sect. 4) in Python, relying on the PyTorch (version 1.7) 
library [25] and its TorchScript extension for interfacing the trained model from 
C++. We modified the automatic theorem prover Vampire (version 4.5.1) to (1) 
optionally record to a log file the constructed derivation, including information 
on selected clauses and clauses found in the discovered proof (the logging-mode), 
(2) to be able to load a trained TorchScript model and use it for clause selection 
guidance under various modes of integration (detailed in Sects. 2.3 and 3).° 

We took the same subset of 20795 problems from the SMT-LIB library [5] 
as in previous work [11]: formed as the largest set of problems in a fragment 
supported by Vampire, excluding problems known to be satisfiable and those 
provable by Vampire’s default strategy in 10s either without adding theory ax- 
ioms or while performing clause selection by age only. 

As the baseline strategy S we took Vampire’s implementation of the DIS- 
COUNT saturation loop under the age-to-weight ratio 1:10 (which typically 
performs well with DISCOUNT), keeping all other settings default, including 
the enabled AVATAR architecture. We later enhanced this S with various forms 
of guidance. All the benchmarking was done using a 10s time limit.® 


5 Supplementary materials can be found at https://git.io/JtHNl. 
® Running on an Intel(R) Xeon(R) Gold 6140 CPUs @ 2.3 GHz server with 500 GB 
RAM, using no more than 30 of the available 72 cores to reduce mutual influence. 
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5.1 Data Preparation 


During an initial run, the baseline strategy S was able to solve 734 problems 
under the 10s time limit. We collected the corresponding successful derivations 
using the logging-mode (and lifting the time limit, since the logging causes a 
non-negligible overhead) and processed them into a form suitable for training 
a neural model. The derivations contained approximately 5.0 million clauses in 
total (the overall context), out of which 3.9 million were selected’ (the training 
examples) and 30 thousand of these appeared in a proof (the positive examples). 
In these derivations, Vampire used 31 distinct theory axioms to facilitate theory 
reasoning. Including the “user input” label for clauses coming from the actual 
problem files, there were in total 32 distinct labels for the derivation leaves. 
In addition, we recorded 15 inference rules, such as resolution, superposition, 
backward and forward demodulation or subsumption resolution and including 
one rule for the derivation of a component clause in AVATAR [26,39]. Thus we 
obtained 15 distinct labels for the internal nodes. 

We compressed these derivations identifying clauses with the same “abstract 
derivation history” dictated by the labels, as described in Sect. 4.4. This reduced 
the derivation set to 0.7 million nodes (i.e. abstracted clauses) in total. Out of 
the 734 derivations 242 were still larger than 1000 nodes (the largest had 6426 
nodes) and each of these gave rise to a separate “mini-batch”. We grouped the 
remaining 492 derivations to obtain an approximate size of 1000 nodes per mini- 
batch (the maximum was 12 original derivations grouped in one mini-batch). In 
total, we obtained 412 mini-batches and randomly singled out 330 (i.e., 80%) of 
these for training, keeping 82 aside for validation. 


5.2 Training 


Since the size of the training set is relatively small, we instantiated the architec- 
ture described in Sect. 4.2 with embedding size n = 64 and dropout probability 
p = 0.3. We trained for 100 epochs, with a non-constant learning rate peaking at 
a = 2.5 x 1074 in epoch 50. Every epoch we computed the loss on the validation 
set and selected the model which minimizes this quantity. This was the model 
from epoch 45 in our case, which we will denote M here. 

The development of the training and validation loss throughout training, as 
well as that of the learning rate, is plotted in Fig. 1. Additionally, the right 
side of the figure allows us to compare the validation loss—an ML estimate of 
the model’s ability to generalize—with the ultimate metric of practical gener- 
alization, namely the number of in-training-unseen problems solved by Vam- 
pire equipped with the corresponding model for guidance. We can see that the 
“proxy” (i.e. the minimisation of the validation loss) and the “target” (i.e. the 
maximisation of ATP performance) correspond quite well, at least to the degree 
that we measured the highest ATP gain with the validation-loss-minimizing M. 


T Ancestors of selected clauses are sometimes not selected clauses themselves if they 
arise through immediate simplifications or through reductions. 
8 Integrated using the layered scheme with a second level ratio 2:1 (cf. Sect. 5.3). 
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Fig. 1. Training the neural model. Red: the training (left) and validation (right) loss 
as a function training time; shaded: per problem weighted standard deviations. Blue 
(left): the supplied non-constant learning rate (cf. Sect. 4.3). Green (right): in training 
unseen problems solved by Vampire equipped with the corresponding model. 


We remark that this assurance was not cheap to obtain. While the whole 
100 epoch training took 45 minutes to complete (using 20 workers and 1 master 
process in a parallel training setup), each of the 20 ATP evaluation data points 
corresponds to approximately 2 hours of 30 core computation. 


5.3 Advice Integration 


In this part of the experiment we tested the various ways of integrating the learnt 
advice as described in Sects. 2.3 and 3. Let us recall that these are the single 
queue schemes MTP and M1° based on the raw logits and the binary decision, 
respectively, their combinations S@ M-® and S p M+? with the base strategy 
S under some second level ratio, and, finally, S @ S[M1], the integration of the 
guidance by the layered clause selection scheme. 

Our results are shown in Table 1. It starts by reporting on the performance of 
the baseline strategy S and then compares it to the other strategies (the gained 
and lost columns are w.r.t. the original run of S).? We can see that the two single 
queue approaches are quite weak, with the better M1? solving only 25% of the 
baseline. Nor can the combination S @ MTE be considered a success, as it only 
solves more problems when less and less advice is taken, seemingly approaching 
the performance of S from below. This trend repeats with S 6M", although 
here an interesting number of problems not solved by the baseline is gained by 
strategies which rely on the advice more than half of the time. 

With our model M, only the layered clause selection integration S 6 S[M?] 
is able to improve on the performance of the baseline strategy S. In fact, it 


° We had to switch to a different machine after producing the training data. There, 
a rerun of S gave a slightly better performance than the 734 solved problems used 
for training. We still used the original run’s results to compute the gained and lost 
values here; the percentage solved is with respect to the new run of S. 
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Table 1. Performance results of various forms of integrating the model advice. 


strategy ratio|M eval. time%|#solved (percent S)|gained lost 
S = 0% 756 100 % 2% 4 
ME = 25% 55 7% 25 704 
Mt? = 13% 190 25% 30 574 
SME 51 57% 543 71% 86 277 
2:1 48% 445 58% 78 367 

1:1 41% 335 44% 54 453 

1:2 32% 248 32% 39 525 

1:5 32% 140 18% 28 622 

Sp M 10:1 11% 686 90% 80 128 
2:1 14% 602 79% 112 244 

1:1 14% 555 73% 111 290 

1:2 14% 519 68% 132 347 

1:10 14% 520 68% 132 346 

Sa SİM] 2:1 27% 855 113% 210 89 
1:1 32% 1032 136 % 411 113 

1:2 33% 1036 137% 430 128 

1:3 30% 1026 135% 428 136 

1:5 25 % 989 130% 405 150 


Table 2. Performance decrease caused by turning off abstraction caching and lazy 
evaluation, and both; demonstrated on S@S [Mmt] under the second level ratio 1:2. 


|M eval. time%|#solved (percent S) 


both techniques enabled 33 % 1036 137% 
without abstraction caching 45% 1007 133 % 
without lazy evaluation 58 % 905 119% 
both techniques disabled 73 % 782 103 % 


improves on it very significantly: with the second level ratio of 1:2 we achieve 
137% performance of the baseline and gain 430 problems unsolved by S. 


5.4 Evaluation Speed, Lazy Evaluation, and Abstraction Caching 


Table 1 also shows the percentage of computation time the individual strategies 
spent evaluating the advice, i.e. interfacing M. 

A word of warning first. These number are hard to interpret across different 
strategies. It is because different guidance steers the prover to different parts 
of the search space. For example, notice the seemingly paradoxical situation 
most pronounced with S p MTE, where the more often is the advice from M 
nominally requested, the less time the prover spends interfacing M. Looking 
closely at a few problems, we discovered that in strategies relying a lot on ME, 
such as S @ MTE under the ratio 1:5, most of the time is spent performing 
forward subsumption. An explanation is that the guidance becomes increasingly 
bad and the prover slows down, processing larger and larger clauses for which 
the subsumption checks are expensive and dominate the runtime. 1° 


10 A similar experience with bad guidance has been made by the authors of ENIGMA. 
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Fig. 2. The receiver operating characteristic curve (left) and a related plot with explicit 
threshold (right) for the selected model M; both based on validation data. 


When the guidance is the same, however, we can use the eval. time percent- 
age to estimate the efficiency of the integration. The results shown in Table 1 
were obtained using both lazy evaluation!! and abstraction caching (as described 
in sections 3.1 and 4.4). Taking the best performing S @ S[M1*] under the sec- 
ond level ratio 1:2, we selectively disabled: first abstraction caching, then lazy 
evaluation and finally both techniques, obtaining the values shown in Table 2. 

We can see that the techniques considerably contribute to the overall per- 
formance. Indeed, without them Vampire would spend the whole 73% of com- 
putation time evaluating the network (compared to only 33%) and the strategy 
would barely match (with 103%) the performance of the baseline S. 


5.5 Positive Bias 


Two important characteristics, from a machine learning perspective, of an ob- 
tained model are the true positive rate (TPR) (also called sensitivity) and the 
true negative rate (TNR) (also specificity). TPR is defined as the fraction of 
positively labeled examples which the model also classifies as such. TNR is, 
analogously, the fraction of negatively labeled examples. Our model M achieves 
(on the validation set) 86% TPR and 81% TNR. 

The final judgement of a neural classifier follows from a comparison to a 
threshold value t, set by default to t = 0 (recall Sect. 4.1). Changing this thresh- 
old allows us to trade TPR for TNR and vice versa in straightforward way. The 
interdependence of these two values on the varied threshold is traditionally cap- 
tured by the so called receiver operating characteristic (ROC) curve, shown for 
our model in Fig. 2 (left). The tradition dictates that the x axis be labeled by the 
false positive rate (FPR) (also called fall-out) which is simply 1 — TNR. Under 
such presentation, one generally strives to pick a threshold value at which the 


11 With the exception of the M7? guidance, with which it is incompatible. 
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Table 3. The performance of S6S[M'] under the second level ratio 1:2 while changing 
the logit threshold. A smaller threshold means more clauses classified as positive. 


threshold|#solved (percent S)|gained lost 
—0.50 1063 140 % 427 98 
—0.25 1066 141% 439 107 
0.00 1036 137 % 430 128 
0.25 945 125 % 375 164 
0.50 825 109 % 278 187 


curve is the closest to the upper left corner of the plot.!? However, this is not 
necessarily the best configuration for every application. 

In the Fig. 2 (right), we “decompose” the ROC curve by using the threshold 
t for the independent axis x. We also highlight, for every problem (again, in 
the validation set), what is the minimal logit value across all positively labeled 
examples belonging to that problem. In other words, what is the logit of the 
“least positively classified” clause from the problem’s proof. We can see that 
for the majority of the problems these minima are below the threshold t = 0. 
This means that for those problems at least one clause from the original proof 
is getting classified as negative by M under t = 0. 

These observations motivated us to experiment with non-zero values of the 
threshold in an ATP evaluation. Particularly promising seemed the use of a 
threshold t smaller than zero with the intention of classifying more clauses as 
positive. The results of the experiment are in shown Table 3. Indeed, we could 
further improve the best performing strategy from Table 1 with both t = —0.25 
and t = —0.5. It can be seen that smaller values lead to fewer problems lost, but 
even the ATP gain is better with t = —0.25 than with the default t = 0, leading 
to the overall best improvement of 141% with respect to the baseline S. 


5.6 Learning from Guided Proofs and Negative Mining 


As previously unsolved problems get proven with the help of the trained guid- 
ance, the new proofs can be used to enrich the training set and potentially help 
obtaining even better models. This idea of alternating the training and the ATP 
evaluation steps in a reinforcing loop has been proposed and successfully real- 
ized by the authors of ENIGMA on the Mizar dataset [18]. Here we propose an 
enhancement of the idea and repeat an analogous experiment in our setting. 
By collecting proofs discovered by a selection of 8 different configurations 
tested in the previous sections, we grew our set of solved problems from 734 to 
1528. We decided to keep one proof per problem, strictly extending the origi- 
nal training set. We then repeated the same training procedure as described in 
Sect. 5.2 on this new set and on an extension of this set obtained as follows. 


Negative mining: We suspected that the successful derivations obtained with 
the help of M might not contain enough “typical wrong decisions” from the 


12 Minimizing the standard cross entropy loss should actually automatically “bring the 
curve” close to that corner for the threshold t = 0. 
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Table 4. The performance of new models learned from guided proofs. U is the set of 
1528 problems used for the training. The gained and lost counts are here w.r.t. U. 


#solved (percent S) (percent |?/|)|gained lost 
plain) 1268 167 % 82% 90 350 
with negative mining) 1394 184% 91% 140 274 


perspective of S to provide for good enough training. We therefore logged the 
failing runs of S on the (1528 — 734) problems only solved by one of the guided 
strategies and augmented the corresponding derivations with these.!* 

Table 4 confirms'* that negative mining indeed helps to produce a better 
model. Mainly, however, it shows that training from additional derivations fur- 
ther dramatically improves the performance of the obtained strategy. 


6 Conclusion 


We revisited the topic of ENIGMA-style clause selection guidance by a machine 
learned binary classifier and proposed four improvements to previous work: (1) 
the use of layered clause selection for integrating the advice, (2) the lazy evalu- 
ation trick to reduce the overhead of interfacing a potentially expensive model, 
(3) the “positive bias” idea suggesting to be really careful not to discard poten- 
tially useful clauses, and (4) the “negative mining” technique to provide enough 
negative examples when learning from proofs obtained with previous guidance. 

We have also shown that a strong advice can be obtained by looking just 
at the derivation history to discriminate a clause. The automatically discovered 
neural guidance significantly improves upon the human-engineered heuristic [11] 
under identical conditions. Rerunning S with the theory heuristic enabled in its 
default form [10] resulted here in 816 (107%) solved problems. 

By deliberately focusing of the representation of clauses by their derivations, 
we obtained some nice properties, such as relative speed of evaluation. However, 
in situations where theory reasoning by automatically added theory axioms is 
not prevalent, such as on most of the TPTP library, we expect guidance based 
on derivations with just a single axiom origin label, the input, to be quite weak. 

Still, we see a great opportunity in using statistical methods for analyzing 
ATP behaviour; not only for improving prover performance with a black box 
guidance, but also as a tool for discovering regularities that could be exploited 
to improve our understanding of the technology on a deeper level. 
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13 Negative mining has, for instance, been previously used when training deep models 
for the premise selection task [14]. 
14 The ATP eval was again integrating via S@S[M"'] under the second level ratio 1:2. 
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Abstract. We introduce a modular and transparent approach for aug- 
menting the ability of reinforcement learning agents to comply with a 
given norm base. The normative supervisor module functions as both 
an event recorder and real-time compliance checker w.r.t. an external 
norm base. We have implemented this module with a theorem prover for 
defeasible deontic logic, in a reinforcement learning agent that we task 
with playing a “vegan” version of the arcade game Pac-Man. 
1 Introduction 


Autonomous agents are an increasingly integral part of modern life. While 
performing activities formerly reserved for human agents, they must possess 
the ability to adapt to (potentially unpredictable) changes in their environment; 
reinforcement learning (RL) has proven a successful method for teaching agents 
this behaviour (see, e.g. [16,13]). Performing human roles further requires that 
agents align themselves with the ethical standards their human counterparts are 
subject to, introducing a requirement for ethical reasoning. RL has been employed 
to enforce such standards as well (see, e.g., [14]); agents can be trained to act 
in line with further rewards/penalties assigned according to the performance 
of ethical/unethical behaviour through a reward function. However, this does 
not provide a guarantee of the desired behaviour. Moreover, such techniques are 
not well equipped to handle the complexities of ethical reasoning. In general, 
like other black-box machine learning methods, RL cannot transparently explain 
why a certain policy is compliant or not. Additionally, when the ethical values 
are embedded in the learning process, a small change in their definition would 
require us to retrain the policy from scratch. 

To obviate the limitations of RL to represent ethical norms, the approach 
we follow in this paper combines RL with Deontic Logic, the branch of formal 
logic that is concerned with prescriptive statements; we implement a normative 
supervisor to inform a trained RL agent of the ethical requirements in force in a 
given situation. Since the pioneering works [17,15], it has been well understood 
that Deontic Logic can be applied to model ethical norms; the difference between 
ethical and legal norms is indeed only on how they emerge, not what normative 


consequences are entailed by them. We implement our normative supervisor using 
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defeasible deontic logic [8,9]. This is a simple and computationally feasible, yet 
expressive, logic allowing for defeasible reasoning, and can easily accommodate 
changes to the norm base, should the ethical requirements become more complex 
(see Sect. 3.4 for a brief walk-through). Moreover, the constructive nature of this 
logic allows us to determine how a given conclusion has been reached. 

By embedding the normative supervisor into the RL agent architecture, the 

agent can follow near-optimal learned policies while enforcing ethical actions 
in a modular and transparent way. The supervisor functions as both an event 
recorder and real-time compliance checker; it corrects the choice of a given action 
from the policy only when this violates a norm. It is furthermore used as an 
event logger to identify and extract new sets of (ethical) norms to promote 
particular goals. We demonstrate our approach on an RL agent that plays a 
“vegan” version of Pac-Man, with an “ethical” constraint forbidding Pac-Man 
from eating ghosts. Already used as a case study in [14,10], the Pac-Man game 
is a closed environment for testing with clearly defined game mechanics and 
parameters which are easy to isolate, manipulate, and extend with variably 
intricate rule sets. We successfully evaluated our approach with several tests, 
consisting of “vegan” games and a “vegetarian” version of the game where the 
agent can eat only one type of ghost. The achievement of full compliance in the 
latter case was possible with the introduction of additional norms identified via 
the event recorder. 
Related Work. The papers [14] and [10] on Pac-Man motivated our work. The 
former employs multi-objective RL with policy orchestration to impose normative 
constraints on vegan Pac-Man. It seamlessly combines ethically compliant be- 
haviour and learned optimal behaviour; however, the ethical reasoning performed 
is still to a degree implicit, it does not provide justifications for the choices 
made, and it is not clear how the approach would remain reasonably transpar- 
ent with more complex norm sets. [10] takes steps to integrate more complex 
constraints on a RL agent, but as they are embedded in the learned policy, it 
lacks the transparency of a logic-based implementation. [1] and [2] address the 
problem of transparency in the implementation of ethical behaviours in AI, but 
their approach has not been implemented and tested yet. Symbolic reasoning 
for implementing ethically compliant behaviour in autonomous agents has been 
used in many frameworks, such as [5], which models the behaviour from a BDI 
perspective. This approach does not allow for defeasible reasoning, and focuses on 
avoiding ethical non-compliance at the planning level. Non-monotonic logic-based 
approaches that extend BDI with a normative component appear in [6,9], whose 
solutions remain only at the theoretical level. These papers belong to the related 
field of Normative Multi-Agent Systems, which is not specifically concerned with 
the ethical behaviour of agents [3], and whose introduced formalisms and tools 
(e.g. [12]) have not yet been used in combination with RL. 


2 Background 

Normative Reasoning. Normative reasoning differs from the reasoning cap- 
tured by classical logic in that the focus is not on true or false statements, but 
rather the imposition of norms onto such statements. 
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We will deal with two types of norms: constitutive and regulative norms 
(see [4] for the terminology). Regulative norms describe obligations, prohibitions 
and permissions. Constitutive norms regulate instead the creation of institutional 
facts as well as the modification of the normative system itself; their content is a 
relation between two concepts, and they will typically take the form “in context 
c, concept x counts as concept y”, where x refers to a more concrete concept (e.g., 
walking) and y to a more abstract one (e.g., moving). We say concept x is at a 
lower level of abstraction than concept y in context c if there is a constitutive 
norm with context c asserting that x counts as y (henceforth denoted C(x, y)). 


Reinforcement Learning (RL). RL refers to a class of algorithms specialized 
in learning how an agent should act in its environment to maximize the expected 
cumulative reward. Given a function that assigns rewards/penalties to each state 
and successor state pair (or state-action pairs), the RL algorithm learns an 
optimal policy, a function from states to actions that can govern its behaviour. 

In our case study we chose Q-learning [18] with function approximation as 
a RL algorithm. In Q-learning, the RL algorithm first learns a function Q(s,a) 
to predict the expected cumulative reward (Q-value) from state s taking action 
a. The learned policy picks the action argmaZaepossible Q(s,a) with the highest 
Q-value over a list of possible actions. The function Q is approximated as a linear 
function which is the weighted sum of features describing some elements of the 
environment (e.g., the distance between the agent and object X); the features 
which are most relevant to predicting the agent success are weighted most heavily. 


Vegan Pac-Man. In the arcade game Pac-Man, an eponymous agent is situated 
inside a maze over a grid, where some cells contain a ‘food pellet’ which Pac-Man 
will eat if it moves inside the cell. Pac-Man’s goal is to maximize his score; when 
Pac-Man eats a food pellet he gains a reward (+10 points), but there is also a 
time penalty (-1 point for every time step). Pac-Man wins when he has eaten 
all the food pellets in the maze (resulting in +500 points), and he loses if he 
collides with one of the ghost agents wandering around the maze (resulting in 
-500 points). However, after eating a ‘power pellet’ (of which there are two), the 
ghosts become ‘scared’, and Pac-Man can eat them (for +200 points). 

Inspired by [14], we consider a variation of the UC Berkeley AI Pac-Man 
implementation [7], where Pac-Man cannot eat ghosts (only blue ghosts in the 
vegetarian version). Our Pac-Man agent utilizes a Q-learning policy; for the 
utility function we use the game’s score, and we take the game states as states. 
We use the same game layout as in [14]; this is a 20 x 11 maze populated with 97 
food pellets and two ghosts (blue and orange) which follow random paths, where 
the maximum score available is 2170, and 1370 when eating ghosts is forbidden. 


3 The Normative Supervisor 


The key component of our approach is a normative supervisor whose architecture 
is illustrated in Fig. 1. This module consists of a normative reasoning engine (we 
use the SPINdle theorem prover [11]), and of other components that encode the 
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norms and environmental data into defeasible deontic logic rules, and translate 
the conclusions of the reasoning engine into instructions for the agent. 
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Fig. 1. Key components and placement of the Normative Supervisor. 
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We place the normative supervisor in the already-trained agent’s control loop 
between the localization and policy module. The localization module identifies 
the current agent’s state with respect to its environment and returns a list of 
possible actions to the normative supervisor. This module filters out all the 
actions that are not compliant with the norms. The policy will then identify, 
among the pool of the compliant actions, the optimal one for generating the next 
game state. If there are no available compliant actions the normative supervisor 
will select the ‘lesser evil’ action. This module additionally enables the logging of 
events during the game for later scrutiny. 


3.1 Configuring the Norm Base 


We start with a simple normative prescription, consisting only of the behavioral 
constraint proposed in [14] that “Pac-Man must not eat ghosts””, represented as 
vegan: F(eat(pacman, ghost)), where F denotes prohibition. 

If this norm base is to inform our agent’s actions, it needs to reference concepts 
that correspond to the information directly processed by the agent, which is 
limited to the locations of game entities and the actions that Pac-Man can 
perform, which we denote as North, South, East, West, and Stop. The only 
way eat(pacman, ghost) can be done is if (a) the ghost is in a ‘scared’ state, 
and (b) Pac-Man and the ghost move into the same cell. These are expressed 
as scared(ghost) and inRange(pacman, ghost) respectively. Pac-Man does not 
know which direction the ghost will move in, but we will assume a “cautious” 
model of action where Pac-Man is not to perform any action that could constitute 
eating a ghost; that is, if Pac-Man takes an action that could reasonably lead 
to him violating a norm, we will consider that norm violated. Since Pac-Man’s 
next action determines what is in range, we will actually need five entities 
to express inRange(pacman, ghost), one corresponding to each action. These 
concepts are used to construct a constitutive norm, or a kind of strategy, regarding 
eating, strategyNnorth : C(North, eat(pacman, ghost)), which is applicable in 
the context {scared(ghost),inNorthRange(pacman, ghost)}. 


? For the time being we generalize the blue and the orange ghosts as ghost. 
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For inNorthRange(pacman, ghost), we have access to the positions of Pac- 
Man and the ghosts, so we can create another set of constitutive norms for 
this, which apply in the context {pacman(i,7)}, rangenorth : C(ghost(k, 1), 
inNorthRange(pacman, ghost)), where (k,l) has a Manhattan distance of one 
or fewer cells from (i,j + 1). 

Finally, we need to consider additional relationships between norms and 
concepts. For this norm base, we only have one regulative norm, so a mechanism 
for conflict resolution is not needed. However, as Pac-Man can only execute one 
action at a time, we have a non-concurrence relation between every action. This 
amounts to an inability to comply with multiple obligations over distinct actions. 
However, since Vegan Pac-Man does not deal with any obligations, additional 
rules will not be needed. 


Representing the Norm Base. We need a formal language — equipped with 
an automated theorem prover — capable of effectively representing and reasoning 
with the norm base; we chose defeasible deontic (propositional) logic (DDPL 
for short) [8]. DDPL is defined over literals and modal literals, and the key 
ingredient is the rules we can construct from them. For the purposes of this paper 
we only consider one deontic modality (obligation O) and define prohibition and 
permission as F(p) = O(=p) and P(p) = =O(-p). 


Definition 1. A rule is an expression r: A(r) >, N(r) where r is a label uniquely 
identifying the rule, A(r) = {a1,...,@n} is the antecedent, N(r) is the consequent, 
>€ {>1,>4,~%}, and the mode of each rule is designated with » € {C,O}. 


Rules labelled by C and O are constitutive and regulative rules, respectively. Strict 
rules (+,) are rules where the consequent strictly follows from the antecedent 
without exception. Defeasible rules (=,) are rules where the consequent typically 
follows from the antecedent, unless there is evidence to the contrary. Defeaters 
(~+,) are rules that only prevent a conclusion from being reached by a defeasible 
rule; regulative defeaters are used to encode permissive rules (see [8]). 

The central concept of DDPL (and our application of it) is: 


Definition 2. A defeasible theory D is a tuple (F,Ro,Rc,>), where F is a set 
of literals (facts), Ro and Rc are sets of regulative and constitutive rules, and > 
is a superiority relation over rules. 


These tools will be utilized to map Pac-Man’s to a defeasible theory; the environ- 
ment translated to a set of facts and the norm base to a set of rules. 


3.2 Automating Translation 


We are now dealing with three kinds of syntax: our informal representation of the 
norm base, the input and output of the host process, and the formal language 
of the reasoner (DDPL and its theorem prover SPINdle [11]). If we frame the 
reasoner as a central reasoning facility, the agent as a front-end, and the norm 
base as a back-end, we can implement this dynamic as a translator with two 
faces, one front-facing and one back-facing, feeding information into the reasoner 
from the agent and the norm base respectively. 
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Front End Translation. The front-end translator will be continuously in use, 
sending new data to be translated and requiring translated proposed actions 
as the environment changes. This will be an algorithm that transforms input 
data from the agent into propositions which assert facts about the agent or 
the environment, and then logical conclusions into instructions the agent will 
understand. Each cell of the Pac-Man grid can contain characters (Pac-Man or 
one of the ghosts), an object (a wall or a food pellet), or nothing at all. Walls 
are accounted for during the localization stage of Pac-Man’s algorithm and food 
pellets are not an entity that appears in the norm base, so we will need to reason 
only about the characters. Hence we have two sets of variables in each game; 
pacman;,; and ghost;,; (along with scared(ghost) if the ghost is in a scared 
state) assert the current coordinates of Pac-Man and of each ghost, and appear 
in a set Facts in the defeasible theory GameState = (Facts, Ro, Ro,>). 
Actions will be represented as deontic literals, in the set 


Actions = { North, South, East, West, Stop} 


A query from Pac-Man to the reasoner will be accompanied by a representation 
of the current game state, along with a list of possible actions, possible, which 
will be translated to the corresponding literal in Actions. 


Back End Translation. In this critical task it is crucial to ensure that norms 
dictate the same behaviour once translated into this language. Besides making 
sure that each component of the norm can be represented by the language, we 
must also analyse our translated norm base with respect to how the available 
metadata is accommodated by the reasoner’s rules of inference. 

We represent the regulative norm of Vegan Pac-Man (vegan) as: 


ae) 7€atyacman,ghost E Ro 


where defeasibility is given as a precautionary measure, in case we want to add 
(potentially conflicting) norms later. 

Note that if moving North counts as eating a ghost, an obligation to go 
North counts as being obligated to eat a ghost, and a prohibition to eat a 
ghost implies a prohibition to move North. So we can rewrite strategyNorth as 
C(O(-eat(pacman, ghost)),O(-=North)), or with the applicable context as: 


Scared srost, iNnNorthRangepacman, ghost: O(-€atpacman, ghost) =0 -North € Ro 


Note that though this a constitutive rule, in DDPL it will be in Ro. This will 
work for all of the constitutive norms attached to a prohibited action, where 
we place the context and the prohibition in question in the antecedent, and the 
prohibition of the concrete action in the strategy is the consequent. 

For the remaining constitutive norms, we have a rather simple conversion. 
These norms will be generated w.r.t. the input from the agent; for example, if 
the agent (Pac-Man) tells us that he is at (2,3), the rule rangenoren will be: 


pacman 3, ghostz 4 >c inNorthRangepacman,ghost € Rc 


We have found that it is more time-efficient to generate these constitutive 
norms anew whenever the fact set changes, instead of generating every possible 
constitutive norm ahead of time, and having SPINdle deal with all at once. 
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3.3 Classify and Assess Conclusions 


Once we understand how various concepts are represented in the reasoner lan- 
guage, we need to parse the possible outputs of the reasoning engine into indicators 
as to which actions in the agent’s arsenal are compliant with the norm base. 


Compliant Solutions. Ideally, we will want to locate a compliant solution — an 
action that constitutes a possible course of action for the agent that does not 
violate any norms — from the conclusions yielded by the reasoner. 


Definition 3. A set of compliant solutions is: (1) non-empty, and consisting 
only of (2) solutions composed of possible actions, (3) solutions that do not violate 
any norms, and (4) solutions that are internally consistent. 


The manner in which we construct such a set is heavily influenced by the 
output (conclusions) yielded by SPINdle. Conclusions in DDPL are established 
over proofs and can be classified as defeasible or definite, and positive or negative. 
A positive conclusion means that the referenced literal holds, while a negative 
indicates that this literal has been refuted. A definite conclusion is obtained by 
using only strict rules and facts using forward chaining of rules. A conclusion holds 
defeasibly (denoted by +0c¢ for a factual conclusion and +0o for an obligation) if 
there is an applicable rule for it and the rules for the opposite cannot be applied 
or are defeated. Over the course of a proof, each rule will be classified as either 
applicable (i.e., the antecedent holds and the consequent follows), discarded (i.e., 
the rule is not applied because the antecedent doesn’t fully hold), or defeated 
by a defeater or a higher priority rule. For a set of rules R, R[p], Ro and R°% 
are, respectively, the subsets of: the rules for p, regulative rules, and strict or 
defeasible rule. The definition of provability for defeasible obligations [8] (we 
define only defeasible conclusions, because in our formalization regulative norms 
were expressed as defeasible rules) is: 


Definition 4. Given a defeasible theory D, if Dt +00 p, then: 
1. dre RS [p] that is applicable defeasible, and 
2. Ys e Ro[=p] either: (a) s is discarded, or (b) s € R°? and 3t € Ro[p] s.t. t is 
applicable, t >s, or (c) s is a defeater, 3t € R&[p] s.t. t is applicable, t > s 


A derivation in DDPL has a three phase argumentation structure, where argu- 
ments are simply applicable rules: (1) we need an argument for the conclusion we 
want to prove, (2) we analyse all possible counter-arguments, and (3) we rebut 
the counter-arguments. An argument can be rebutted when it is not applicable or 
when it is defeated by a stronger applicable argument. If we exclude the undercut 
case, in every phase the arguments attack the arguments in the previous phase. 
A rule attacks another rule if the conclusions of the two rules are contradictory 
(note that P(q) and P(~q) are not a deontic contradiction). Accordingly, any 
regulative rule for q attacks a strict or defeasible regulative rule for ~q. However, a 
regulative defeater for q is not attacked by a regulative defeater for ~q (condition 
2(c) above). 

We parse out a solution set by: (1) if we do not receive a full set of conclusions 
from SPINdle, we return an empty set; (2) we remove all conclusions that do 
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not reference a literal in possible; (3) any action corresponding to a defeasibly 
proved positive literal occurs in every solution; and (4) any action corresponding 
to a defeasibly proved negative literal is discarded from every solution. 

Claim: the above procedure yields either an empty set or a compliant solution. 
Proof sketch: If our solution is not internally consistent, we can prove both +d9 a 
and +d9-7a for some action a. In this case SPINdle will return neither, and the 
above procedure leads to an empty set in step (1). Only possible actions will 
occur in a solution as per step (2), and any solutions which fail to comply with an 
obligation or prohibition will be excluded through step (3) and (4) respectively. 


‘Lesser of two Evils’ Solutions. If the above procedure leaves us with an 
empty solution set, we want to identify which non-compliant actions constitute 
the “best” choice (i.e. are minimally non-compliant). Our characterization of 
degrees of non-compliance depends on the way the reasoner constructs solutions, 
and what information it logs during this process. SPINdle has an inference logger 
that classifies every rule in the theory as discarded, applicable, or defeated. For 
our agent, the chosen degree is a score derived from the of norms that have been 
applied versus those that have been defeated (discarded norms are ignored): 
score := #complied — #violated = #applied — #de feated 
This score is computed through the theory GameState,, which is constructed 
by adding a fact O(a) to GameState. Recall that a rule will be defeated when 
its defeasible theory includes a fact that conflicts with the head of this rule. 
So when we add O(a) to GameState, all norms that prescribed F(a) = O(~a) 
for GameState are defeated and any prescribing O(a) is applied. To compute 
the score, we use SPINdle in a rather unconventional way, ignoring conclusions 
yielded and checking the inference log to count which rules have been applied 
during reasoning (applied) and which were defeated (#defeated) and set 
score = #tapplied — #defeated. This procedure is completed for every action in 
possible, and we select the action(s) with the highest score. If there are multiple 
actions with a highest score, we send multiple solutions to the agent and it will 
pick the best action according to its policy. 
Claim: computing scores for all possible actions is completed in polynomial time. 
Proof sketch: As shown in [8], conclusions in DDPL can be computed in linear 
time with respect to the the number of literal occurrences plus the number of 
the rules in the theory. The claim holds since every action in possible is a literal, 
and the above procedure is completed |possible| times. 


3.4 Revising the Norm Base 


We demonstrate the advantages of our approach — modularity, configurability, 
and capability as an event recorder — through revising our norm base. 

Inherent to Pac-Man’s environment is the possibility of encountering a state 
where no compliant action is possible; in this section we explore how to address 
cases like this through adding or removing rules to the norm base. 

When playing “vegan” Pac-Man, we may encounter the case depicted in 
Fig. 2(a). In absence of additional information Pac-Man will eat whichever ghost 
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raw La Ils — 


(a) (c) Power pellet 


Fig. 2. Pac-Man trapped between two ghosts (a) or in a corner (b). In (c) Pac-Man 
consumes the power pellet and eats the ghost at the same time. 


the policy indicates it should, and a violation report is generated. Each violation 
report is saved as a timestamped file accompanied with the representation of 
the current game state. This report can be used to retroactively examine the 
context in which violations occur, and we can thereby revise our norm base 
which is independent from the agent’s RL policy. In the case of “vegan” Pac-Man, 
these reports make it clear that this version of the game will be susceptible to 
somewhat regular violations in the form of Fig. 2(a). 

If we consider instead “vegetarian” Pac-Man, we can restrict our norm base 
to the vegan rule only applied to the blue ghost. However, situations in which 
compliance is not possible can still occur; for instance the one depicted in Fig. 2(b), 
or the case where Pac-Man consumes a power pellet and the blue ghost at the 
same time, as shown in Fig. 2(c). In the latter case, the violation occurs because, 
prior to Pac-Man’s consumption of the power pellet, the blue ghost is not scared 
and Pac-Man’s strategy to comply with vegan will not be triggered. This is 
roughly analogous to an agent committing an unethical act because it has no 
way of recognizing that it is unethical. 

Summarily, the violation reports show that there are four points in the maze 
where Pac-Man, potentially, cannot comply, given the information he has access 
to; in response, we add a norm danger steering Pac-Man away from these areas: 


>0o enter pacman,danger 


which is accompanied by constitutive norms defining the abstract action of 
“entering danger” (for some pre-defined location denoted as danger), such as: 


inN orthRangepacman,dangers iNRangeghost,danger =0 -North 
4 Evaluation and Conclusion 


We have presented a modular and transparent approach that enables an au- 
tonomous agent in pursuing ethical goals, while still running an RL policy that 
maximizes its cumulative reward. Our approach was evaluated on six tests?, 
in batches of 100 games. The results are displayed in the following table and 
discussed below; we give data on both game performance (average score and 
% games won) and ethical performance (ghosts eaten). Refer to Sec. 2 for a 
thorough description of the testing environment. 

The first two baseline tests measured the performance of Pac-Man using two 
different (ethically agnostic) RL policies without the normative supervisor; this 
establishes a baseline for Pac-Man’s game performance. We refer to the first 


3 We use a laptop with Intel i5-8250U CPU (4 cores, 1.60 GHz) and 8GB RAM, running 
Ubuntu 18.04, Java 8, Python 2.7. 
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Test Won|Score (Avg [Max])| Avg ghosts eaten 
RL policies without Normative Supervisor (Baseline tests) 

la — Safe 88 % 1189.4 [1526] 0.02 (blue) /0.03 (orange) 
1b — Hungry 87 % 1503.5 [2133] 0.89 (blue)/ 0.81 (orange) 
RL policies with Normative Supervisor 

2a — SafeVegan 89 % 1193.39 [1526] 0.01 (blue) /0.02 (orange) 
2b — Hungry Vegan| 92 % 1211.67 [1350] 0.00 (blue) /0.00 (orange) 
3 — Vegetarian 94 % 1413.8 [1742] 0.01 (blue) /0.79 (orange) 
4 — Safe Vegetarian | 87 % 1336.2 [1747] 0.00 (blue)/0.88 (orange) 


RL policy (in Test la) as safe because the algorithm used to train it does not 
differentiate between regular ghosts and scared ghosts, learning how to avoid 
them altogether. We refer to the other RL policy (in Test 1b) as hungry because 
the corresponding algorithm differentiates between regular ghosts and scared 
ghosts, and the agent learns how to eat the scared ghosts. The results for Test 1b 
(average score of 1503.5 maximum score of 2133) were comparable to the baseline 
version in [14] (average score of 1675.9, max score of 2144). 


Tests 2a, 2b, 3, and 4 make use of the normative supervisor. In 2a and 2b, we 
subject Pac-Man to a “vegan” norm base, prohibiting eating all ghosts (for both 
the safe and hungry policies respectively). The results obtained for test 2a were 
comparable to those in [14]: the average number of violations was the same in 
both tests (0.03 ghosts), and our average score was only slightly smaller (1193.39 
instead of 1268.5). Compared with the baseline, the game performance did not 
suffer. For test 2b we obtained instead full compliance. Test 3 and 4 both use the 
hungry policy. In test 3 we subject Pac-Man to a “vegetarian” norm base, where 
only eating blue ghosts is forbidden. Allowing Pac-Man to eat one of the ghosts 
allows him to further maximize his score and avoid the violations depicted in 
Fig. 2(a). Test 4 addresses the two edge cases of non-compliance occurring in 
Test 3 as depicted in Fig. 2(b) and Fig. 2(c) by adding the new rules defined 
in Sec. 3.4, steering Pac-Man away from entering the “dangerous” areas. Here, 
violations were completely eliminated. 


These tests, along with the analysis of the violation reports created in non- 
compliant cases, yielded several insights. The module did not cause Pac-Man’s 
game performance to suffer, and could successfully identify non-compliant be- 
haviour. It implemented compliant behaviour in most cases, with the exception 
of situations where compliance was not possible. The violation reports allowed us 
to identify such situations with ease. 


The game used in this paper offers limited opportunities to work with mean- 
ingful (ethical) norms. We aim to explore alternative case studies with more 
options to define multiple (and possibly conflicting) ethical goals to test the 
interactions between RL and a normative supervisor based on DDPL. 
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Abstract. We present a method for automatically building diagrams 
for olympiad-level geometry problems and implement our approach in 
a new open-source software tool, the Geometry Model Builder (GMB). 
Central to our method is a new domain-specific language, the Geometry 
Model-Building Language (GMBL), for specifying geometry problems 
along with additional metadata useful for building diagrams. A GMBL 
program specifies (1) how to parameterize geometric objects (or sets 
of geometric objects) and initialize these parameterized quantities, (2) 
which quantities to compute directly from other quantities, and (3) ad- 
ditional constraints to accumulate into a (differentiable) loss function. 
A GMBL program induces a (usually) tractable numerical optimization 
problem whose solutions correspond to diagrams of the original prob- 
lem statement, and that we can solve reliably using gradient descent. 
Of the 39 geometry problems since 2000 appearing in the International 
Mathematical Olympiad, 36 can be expressed in our logic and our sys- 
tem can produce diagrams for 94% of them on average. To the best of 
our knowledge, our method is the first in automated geometry diagram 
construction to generate models for such complex problems. 


1 Introduction 


Automated theorem provers for Euclidean geometry often use numerical models 
(i.e. diagrams) for heuristic reasoning, e.g. for conjecturing subgoals, pruning 
branches, checking non-degeneracy conditions, and selecting auxiliary construc- 
tions. However, modern solvers rely on diagrams that are either supplied man- 
ually [7,24] or generated automatically via methods that are severely limited in 
scope [12]. Motivated by the IMO Grand Challenge, an ongoing effort to build 
an AI that can win a gold medal at the International Mathematical Olympiad 
(IMO), we present a method for expressing and solving olympiad-level systems 
of geometric constraints. 

Historically, algebraic methods are the most complete and performant for 
automated geometry diagram construction but suffer from degenerate solutions 


© The Author(s) 2021 
A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 577—588, 2021. 
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1.00 Co Gamma 
1 (param (A B C) triangle) 0.75 
2 (define I point (incenter A B C)) 
3 (define Gamma circle (circumcircle A B C)) 0.50 
4 
5 (define D point (inter-lc (line A I) Gamma (rs-neq A))) 0.25 
6 (param E point (on-circ Gamma) ) 
7 (assert (same-side D E (line B C))) 0.00 
9 (param F point (on-seg B C)) —0.25 
10 (assert (= (uangle B A F) (uangle C A E))) 
11 (assert (lt (uangle C A E) (mul 0.5 (uangle B A C)))) -0.50 
12 
13 (define G point (midp I F)) -0.75 
14 (eval (on-circ (inter-ll (line D G) (line E I)) Gamma)) 
—=1.00 
-1.0 -0.5 0.0 0.5 1.0 


Fig. 1: An example GMBL program and corresponding diagram generated by 
the GMB for IMO 2010 Problem 2. 


and, in the numerical case, non-convexity. These methods are restricted to rela- 
tively simple geometric configurations as poor local minima arise via large num- 
bers of parameters. Moreover, degenerate solutions manifest as poor distribu- 
tions for the vertices of geometric objects (e.g. a non-sensical triangle) as well 
as intersections of objects at more than one point (e.g. lines and circles, circles 
and circles). 

We constructed a domain-specific language (DSL), the Geometry Model- 
Building Language (GMBL), to express geometry problems whose semantics 
induce tractable numerical optimization problems. The GMBL includes a set 
of commands with which users introduce geometric objects and constraints be- 
tween these objects. There is a direct interpretation from these commands to the 
parameterization of geometric objects, the computation of geometric quantities 
from existing ones, and additional numerical constraints. The GMBL employs 
root selector declarations to disambiguate multiple solution problems, reparam- 
eterizations both to reduce the number of parameters and increase uniformity in 
model variance, and joint distributions for geometric objects that are susceptible 
to degeneracy (i.e. triangles and polygons). Our DSL treats points, lines, and 
circles as first-class citizens, and the language can be easily extended to support 
additional high-level features in terms of these primitives. 

We provide an implementation of our method, the Geometry Model Builder 
(GMB), that compiles GMBL programs into Tensorflow computation graphs [1] 
and generates models via off-the-shelf, gradient-based optimization. Figure 2 
demonstrates an overview of this implementation. Experimentally, we find that 
the GMBL sufficiently reduces the parameter space and mitigates degeneracy to 
make our target geometry amenable to numerical optimization. We tested our 
method on all IMO geometry problems since 2000 (n = 39), of which 36 can 
be expressed as GMBL programs. Using default parameters, the GMB finds a 
single model for 94% of these 36 problems in an average of 27.07 seconds. Of 
the problems for which our program found a model and the goal of the problem 
could be stated in our DSL, the goal held in the final model 86% of the time. 
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All code is available on GitHub* with which users can write GMBL programs 
and generate diagrams. Our program can be run both as a command-line tool 
for integration with theorem provers or as a locally-hosted web server. 


2 Background 


Here we provide an overview of olympiad-level geometry problem statements, as 
well as several challenges presented by the associated constraint problems. 


2.1 Olympiad-Level Geometry Problem Statements 


IMO geometry problems are stated as a sequential introduction of potentially- 
constrained geometric objects, as well as additional constraints between entities. 
Such constraints can take one of two forms: (1) geometric constraints describe 
the relative position of geometric entities (e.g. two lines are parallel) while (2) 
dimensional constraints enforce specific numerical values (e.g. angle, radius). 
Lastly, problems end with a goal (or set of goals) typically in the form of geo- 
metric or dimensional constraints. The following is an example from IMO 2009: 


Let ABC be a triangle with circumcentre O. The points P and Q are 
interior points of the sides CA and AB, respectively. Let K, L, and M 
be the midpoints of the segments BP, CQ, and PQ, respectively, and 
let I’ be the circle passing through K, L, and M. Suppose that the line 
PQ is tangent to the circle I’. Prove that OP = OQ. 
(IMO 2009 P2) 
This problem introduces ten named geometric objects and has a single goal. 
Note that this class of problems does not admit a mathematical description 
but rather is defined empirically (i.e. as those problems selected for olympiads). 
The overwhelming majority of these problems are of a particular type — plane 
geometry problems that can be expressed as problems in nonlinear real arith- 
metic (NRA). However, while NRA is technically decidable, olympiad problems 
tend to be littered with order constraints and complex constructions (e.g. mixti- 
linear incenter) and be well beyond the capability of existing algebraic methods. 
On the other hand, they are selected to admit elegant, human-comprehensible 
proofs. It is this class of problems for which the GMBL was designed to express; 
though rare, any particular olympiad geometry problem is not guaranteed to be 
of this type and therefore is not necessarily expressible in the GMBL. 


2.2 Challenge: Globally Coupled Constraints 


A naive approach to generate models would incrementally instantiate objects 
via their immediate constraints. For (IMO 2009 P2), this would work as follows: 


1. Sample points A, B, and C. 


* https: //github.com/rkruegs123/geo-model-builder 
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1 (param (A B C) triangle) 


2 (define D point (midp A B)) — === 


3 (assert (< (dist A B) 1.0)) 


: (0.35, @.98) 
: (-8 -8.60) 


: (1.0, -@.08) 
: (-22.5, 0.15) 


Problem statement Static computation graph 


Ni ical mod: i 
written in DSL umerical model(s) Diagram(s) 


Fig. 2: An overview of our method. Our program takes as input a GMBL program 
and translates it to a set of real-valued parameters and differentiable losses in the 
form of a static computation graph. We then apply gradient-based optimization 
to obtain numerical models and display them as diagrams. 


. Compute O as the circumcenter of AABC. 

. Sample P and Q on the segments CA and AB, respectively. 

. Compute K, L, and M as the midpoints of BP, CQ, and PQ, respectively. 
5. Compute I’ as the circle defined by K, L, and M. 


ew kh 


Immediately we see a problem — there is no guarantee that PQ is tangent to T 
in the final model. Indeed, the constraints of (IMO 2009 P2) are quite globally 
coupled — the choice of P partially determines the circle I’ to which PQ must 
be tangent, and every choice of AABC does not even admit a pair P and Q 
satisfying this constraint. This is an example of the frequent non-constructive 
nature of IMO geometry problems. When there is no obvious reparameterization 
to avoid downstream effects, all constraints must be considered simultaneously 
rather than incrementally or as a set of smaller local optimization problems. 


2.3 Challenge: Root Resolution 


Even in the constructive case, local optimization is not necessarily sufficient given 
that multiple solutions can exist for algebraic constraints. More specifically, two 
circles or a circle and a line intersect at up to two distinct points and in a 
problem that specifies each distinct intersection point, the correct root to assign 
is generally not locally deducible. Without global information, this can lead to 
poor initializations becoming trapped in local minima. The GMBL accounts for 
this by including a set of explicit root selectors as described in Section 3.3. These 
root selectors provide global information for selecting the appropriate point from 
a set of multiple solutions to a system of equations. 


3 Methods 


In this section we present the GMBL and GMB in detail. In our presentation, 
we make use of the following notation and definitions: 


— The type of a geometric object can be one of (1) point, (2) line, or (3) 
circle. We denote the type of a real-valued number as number. 

— We use <> to denote an instance of a type. 

— A name is a string value that refers to a geometric object. 
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3.1 GMBL: Overview 


The GMBL is a DSL for expressing olympiad-level geometry problems that loss- 
lessly induces a numerical optimization problem. It consists of four commands, 
each of which has a direct interpretation regarding the accumulation of (1) real- 
valued parameters and (2) differentiable losses in terms of these parameters: 


1. param: assigns a name to a new geometric object parameterized either by a 
default or optionally supplied parameterization 

2. define: assigns a name to an object computed in terms of existing ones 

3. assert: imposes an additional constraint (i.e. differentiable loss value) 

4. eval: evaluates a given constraint in the final model(s) 


Table 1 provides a summary of their usage. The GMBL includes an extensible 
library of functions and predicates with which commands are written. Notably, 
this library includes a notion of root selection to explicitly resolve the selection 
of roots to systems of equations with multiple solutions. 


3.2 GMBL: Commands 


In the following, we describe in more detail the usage of each command and their 
roles in constructing a tractable numerical optimization problem. 

param accepts as arguments a string, a type, and an optional parameter- 
ization. This introduces a geometric object that is parameterized either by the 
default parameterization for <type> or by the supplied method. Each primitive 
geometric type has the following default parameterization: 


— point: parameterized by its x- and y-coordinates 
— line: parameterized by two points that define the line 
— circle: parameterized by its origin and radius 


Optional parameterizations embody our method’s use of reparameterization to 
decrease the number of parameters and increase model diversity. For example, 
consider a point C on the line AB that is subject to additional constraints. Rather 
than optimizing over the x- and y-coordinates of C, we can express C in terms 
of a single value z that scales C’s placement on the line AB. 

In addition to the standard usage of param outlined above, the GMBL in- 
cludes an important variant of this command to introduce sets of points that 
form triangles and polygons. This variant accepts as arguments (1) a list of point 
names, and (2) a required parameterization (see Table 1). This joint parameter- 
ization of triangles and polygons further prevents degeneracy. For example, to 
initialize a triangle AABC, we can sample the vertices from normal distribu- 
tions with means at distinct thirds of the unit circle. This method minimizes the 
sampling of triangles with extreme angle values, as well as allows for explicit con- 
trol over the distribution of acute vs. obtuse triangles by adjusting the standard 
deviations. Appendix C includes a list of all available parameterizations.° 


5 All appendices can be found in the long version of this paper [15]. 


582 R. Krueger et al. 


Table 1: An overview of usage for the four commands. 


Command Usage 


(param <string> <type> <optional-parameterization>) 


param or 
(param (<string>, ..., <string>) <parameterization>) 
define (define <string> <type> <value>) 
assert (assert <predicate>) 
eval (eval <predicate>) 


define accepts as arguments a string, a type, and a value that is one 
of <point>, <line>, or <circle>. This command serves as a basic assignment 
operator and is useful for caching commonly used values. The functions described 
in Section 3.3 are used to construct <value> from existing geometric objects. 

assert accepts a single predicate and imposes it as an additional constraint 
on the system. This is achieved by translating the predicate to a set of algebraic 
values and registering them as losses. This command does not introduce any new 
geometric objects and can only refer to those already introduced by param or 
define. Notably, dimensional constraints and negations are always enforced via 
assert. Detail on supported predicates is presented in Section 3.3. 

eval, like assert, accepts a single predicate and therefore does not in- 
troduce any new geometric objects. However, unlike assert, the corresponding 
algebraic values are evaluated and returned with the final model rather than 
registered as losses and enforced via optimization. This command is most useful 
for those interested in integrating the GMBL with theorem provers. 


3.3 GMBL: Functions and Predicates 


The second component of our DSL is a set of functions and predicates for con- 
structing arguments to the commands outlined above. Functions construct new 
geometric objects and numerical values whereas predicates describe relationships 
between them. Our DSL includes high-level abstractions for common geometric 
concepts in olympiad geometry (e.g. excircle, isotomic conjugate). 

Functions in the GMBL employ a notion of root selectors to address the 
“multiple solutions problem” described in Section 2.3. In plane geometry, this 
problem typically manifests with multiple candidate point solutions, such as the 
intersection between a line and a circle. Root selectors control for this by allowing 
users to specify the appropriate point for functions with multiple solutions. 
Figure 3 demonstrates their usage in the functions inter-1c (intersection of a 
line and circle) and inter-cc (intersection of two circles). 

Importantly, arguments to predicates and functions can be specified with 
functions rather than named geometric objects. For a list of supported functions, 
predicates, and root selectors, refer to Appendices A, B, and C, respectively. 
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1 (param Gamma circle) 


2 (param l line) C Gamma 
3 Co Omega 
4;; Intersection points of l and Gamma 2 


5 (define A point (inter-lc l Gamma rs-arbitrary)) 
6 (define B point (inter-1lc l Gamma (rs-neq A))) 

7 

8 ;; Intersection points of l and Omega 

9 (param Omega circle) 0 
10 (define C point (inter-lc l Omega (rs-closer-to-p A))) 
11 (define D point (inter-lc l Omega (rs-neq C))) 

12 

13 ;; Intersection points of Gamma and Omega 


14 (define E point (inter-cc Gamma Omega (rs-closer-to-l 1))) -2 
15 (define F point (inter-cc Gamma Omega (rs-neq E))) 3 2 -1 o 1 2 
(a) A GMBL program that uses root selectors. (b) A corresponding diagram. 


Fig. 3: An example usage of root selectors to resolve the intersections of lines 
and circles, and circles and circles. 


3.4 Auxiliary Losses 


The optimization problem encoded by a GMBL progran includes three additional 
loss values. Foremost, for every instance of a circle intersecting a line or other 
circle, we impose a loss value that ensures the two geometric objects indeed 
intersect. The final two, albeit opposing losses are intended to minimize global 
degeneracy. We impose one loss that minimizes the mean of all point norms 
to prevent exceptionally separate objects and a second to enforce a sufficient 
distance between points to maintain distinctness. 


3.5 Implementation 


We built the GMB, an open-source implementation that compiles GMBL pro- 
grams to optimization problems and generates models. The GMB takes as input 
a GMBL program and processes each command in sequence to accumulate real- 
valued parameters and differentiable losses in a Tensorflow computation graph. 
After registering auxiliary losses , we apply off-the-shelf gradient-based local op- 
timization to produce models of the constraint system. In summary, to generate 
N numerical models, our optimization procedure works as follows: 


. Construct computation graph by sequentially processing commands. 

. Register auxiliary losses. 

. Sample sets of initial parameter values and rank via loss value. 

. Choose (next) best initialization and optimize via gradient descent. 

. Repeat (4) until obtaining N models or the maximum # of tries is reached. 


oR WN rR 


Our program accepts as arguments (1) the # of models desired (default = 1), 
(2) the # of initializations to sample (default = 10), and (3) the max # of 
optimization tries (default = 3). Our program also accepts the standard suite 
of parameters for training a Tensorflow model, including an initial learning rate 
(default = 0.1), a decay rate (default = 0.7), the max # of iterations (default = 
5000), and an epsilon value (default = 0.001) to determine stopping criteria. 
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Table 2: An evaluation of our method’s ability to generate a single model for 
each of the 36 IMO problems encoded in our DSL. For each problem, 10 sets of 
initial parameters were sampled over which our program optimized up to three. 
All data shown are the average of three trials. The first row demonstrates results 
using default parameters (e = 0.001, learning rate = 0.1, # iterations = 5,000). 


Learning Iterations % Success % Goal Time per Problem (s) 


Rate Satisfaction 


All Fail Success 


0.001 0.1 5,000 93.52 85.84 27.07 223.51 14.72 
0.01 01 5,000 92.60 84.71 26.86 229.71 14.43 
0.001 0.01 5,000 88.88 86.32 27.54 137.85 14.33 
0.001 0.1 10,000 92.59 86.02 34.78 287.51 15.43 
4 Results 


In this section, we present an evaluation of our method’s proficiency in three 
areas of expressing and solving olympiad-level geometry problems: 


1. Expressing olympiad-level geometry problems as GMBL programs. 
2. Generating models for these programs. 
3. Preserving truths (up to tolerance) that are not directly optimized for. 


Table 2 contains a summary of our results. 

Our evaluation considers all 39 IMO geometry problems since 2000. Of these 
39 problems, 36 can be expressed in our DSL. Those that we cannot encode 
involve variable numbers of geometric objects. For 32 of these 36 problems, we 
can express the goals as eval commands in the corresponding GMBL programs. 
The goals of the additional four problems are not expressible in our DSL, e.g. 
our DSL cannot express goals of the form “Find all possible values of ZABC.” 

To evaluate (2) and (3), we conducted three trials in which we ran our pro- 
gram on each of the 36 encodings with varying sets of arguments. With default 
arguments, our program generated a single model for (on average) 94% of these 
problems. Our program ran for an average of 27.07 seconds for each problem but 
there is a stark difference between time to success and time to failure (14.72 vs 
223.51 seconds) as failure entails completing all optimization attempts whereas 
successful generation of a model terminates the program. We achieve similar 
success rates with more forgiving training arguments or a higher tolerance. 

For use in automated theorem proving, it is essential that models generated 
by our tool not only satisfy the constraint problem up to tolerance but also any 
other truths that follow from the set of input constraints. The most immediate 
example of such a truth is the goal of a problem statement. Therefore, we used 
the goals of IMO geometry problems as a proxy for this ability by only checking 
the satisfaction of the goal in the final model (i.e. with an eval statement) 
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rather than directly optimizing for it. In our experiments, we considered such a 
goal satisfied if it held up to e 10 as it is reasonable to expect slightly higher 
floating-point error without explicit optimization. Using default parameters, the 
goal held up to tolerance in 86% of problems for which we found a model and 
could express the goal. This rate was similar across all other sets of arguments. 


5 Future Work 


Here we discuss various opportunities for improvement of our method. 

Firstly, improvements could be made to our method of numerical optimiza- 
tion. While Tensorflow offers a convenient way of caching terms via a static 
computation graph and optimizing directly over this representation, there is not 
explicit support for constrained optimization. Because of this, arbitrary weights 
have to be assigned to each loss value. Though rare, this can result in false 
positives and negatives for the satisfaction of a constraint. Using an explicit 
constrained-optimization method (e.g. SLSQP) would enable the separation of 
soft constraints (e.g. maximizing the distance between points) and hard con- 
straints (e.g. those enforced by assert), removing the need for arbitrary weights. 

Secondly, cognitive overhead could be reduced as users are currently required 
to determine degrees of freedom; it would be far easier to write problem state- 
ments using only declarations of geometric objects and constraints between them, 
e.g. using only assert. This could be accomplished by treating our DSL as a 
low-level “instruction set” to which a higher-level language could be compiled. 
The main challenge of such a compiler would be appropriately identifying op- 
portunities to reduce the degrees of freedom. To achieve this, the compiler would 
require a decision procedure for line and circle membership. 

Lastly, we could improve our current treatment of distinctness. To prevent 
degenerate solutions, our method optimizes for object distinctness and rejects 
models with duplicates. However, there is the occasional problem for which a 
local optimum encodes two provably distinct points as equal up to floating point 
tolerance. There are many techniques that could be applied to this problem (e.g. 
annealing) though we do not consider them here as the issue is rare. 


6 Related Work 


Though many techniques for mechanized geometry diagram construction have 
been introduced over the decades, no method, to the best of our knowledge, can 
produce models for more than a negligible fraction of olympiad problems. There 
exist many systems, built primarily for educational purposes, for interactively 
generating diagrams using ruler-and-compass constructions, e.g. GCLC [13], Ge- 
oGebra [11], Geometer’s Sketchpad [20], and Cinderella [19]. There are also non- 
interactive methods for deriving such constructions, e.g. GeoView [2] and pro- 
gram synthesis [9,12]. However, as discussed in Section 2.2, very few olympiad 
problems can be described in such a form. Alternatively, Penrose is an early- 
stage system for translating mathematical descriptions to diagrams that relies 
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on constrained numerical optimization and therefore does not suffer from this 
expressivity limitation [25]. However, this system lacks support for constraints 
with multiple roots, e.g. intersecting circles. There are more classical methods 
that similarly depart from constructive geometry. MMP/Geometer [8] translates 
the problem to a set of algebraic equations and uses numerical optimization 
(e.g. BFGS) and GEOTHER [22, 23] first translates a predicate specification 
into polynomial equations, decomposes this system into representative triangu- 
lar sets, and obtains solutions for each set numerically. Neither of these programs 
are available to evaluate though we did test similar approaches using modern li- 
braries (specifically: sympy [17] and scipy [21]) and both numerical and symbolic 
methods would almost always timeout on relatively simple olympiad problems. 
Generating models for systems of geometric constraints is also a challenge 
in computer-aided design (CAD) for engineering diagram drawing. Recent ef- 
forts focus on graph-based synthetic methods, a subset of techniques concerned 
with ruler-and-compass constructions [3,5,6, 10,14, 16,18]. Most relevant to our 
method are Bettig and Shah’s “solution selectors” which, similar to root selec- 
tors in the GMBL, allow users to specify the configuration of a CAD model [4]. 
However, these solution selectors are purpose-built and do not generalize. 


7 Conclusion 


It is standard in GTP to rely on diagrams for heuristic reasoning but the scale 
of automatic diagram construction is limited. To enable efforts to build a solver 
for IMO geometry problems, we developed a method for building diagrams for 
olympiad-level geometry problems. Our method is based on the GMBL, a DSL 
for expressing geometry problems that induces (usually) tractable numerical op- 
timization problems. The GMBL includes a set of commands that have a direct 
interpretation for accumulating real-valued parameters and differentiable losses. 
Arguments to these commands are constructed with a library of functions and 
predicates that includes notions of root selection, joint distributions, and repa- 
rameterizations to minimize degeneracy and the number of parameters. We im- 
plemented our approach in an open-source tool that translates GMBL programs 
to diagrams. Using this program, we evaluated our method on all IMO geometry 
problems since 2000. Our implementation reliably produces models; moreover, 
known truths that are not directly optimized for typically hold up to tolerance. 
By handling configurations of this complexity, our system clears a roadblock in 
GTP and provides a critical tool for undertakers of the IMO Grand Challenge. 
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Abstract. Fusemate is a logic programming system that implements the possible 
model semantics for disjunctive logic programs. Its input language is centered 
around a weak notion of stratification with comprehension and aggregation oper- 
ators on top of it. Fusemate is implemented as a shallow embedding in the Scala 
programming language. This enables using Scala data types natively as terms, a 
tight interface with external systems, and it makes model computation available 
as an ordinary container data structure constructor. The paper describes the above 
features and implementation aspects. It also demonstrates them with a non-trivial 
use-case, the embedding of the description logic ALCI F into Fusemate’s input 
language. 


1 Introduction 


Fusematd}]is a logic programming system for computing possible models of disjunctive 
logic programs [23]24]. A Fusemate logic program consists of (typically) non-ground 
if-then rules with stratified default negation in the body [21]. Stratification entails that a 
true default-negated body literal remains true in the course of deriving new conclusions. 

Fusemate was introduced in for modelling systems that evolve over time and 
for analysing their current state based on the events so far. Such tasks are often sub- 
sumed under the terms of stream processing, complex event recognition, and situational 
awareness, and have been addressed (also) with logic-based approaches [2]9]4]5]. 

To my knowledge, Fusemate is unique among all these and other logic programming 
systems (and theorem provers) in the way it is implemented. Fusemate is 
implemented by shallow embedding in a full-fledged programming language, Scala [25]. 
Essentially, the user writes a syntactically sugared Scala program utilizing familiar 
logic programming notation, and the program’s execution returns models. This has 
advantages and disadvantages. The main disadvantages is that it is more difficult to 
implement performance boosting measures like term indexing. The main advantage is 
that interfacing with data structure libraries and with external systems is easy, an aspect 
whose importance has been emphasized for virtually all of the above systems. In fact, 
Fusemate is motivated in parts by exploring how far the embedding approach can be 
pushed and to what benefit. 

The earlier Fusemate paper [7] focused on the model computation calculus with a 
belief revision operator as the main novelty. It utilized a certain notion of stratification 


1 Fusemate is available at|https://bitbucket.csiro.au/users/bau050/repos/fusemate/ 
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by time (SBT) for making the calculus effective and useful in the intended application 
areas. This system description focuses on the advantages of the shallow embedding 
approach as boosted by new language features introduced here. These new language 
features are (a) non-standard comprehension and aggregation operators, among others, 
and (b) a weaker notion of stratification by time and predicates (SBTP). In brief, SBTP 
is a lexicographic combination of stratification by time and the standard stratification 
in terms of the call-graph of the program. Section |5S|has an example that demonstrates 
the need for (a) and (b) in combination, and So ees the shallow embedding 
approach and its advantages on a more general level. 

Here is an excerpt from a Fusemate program that previews some of the new features: 


1 type Time = java.time.LocalDateTime 

2 val allids = 1 to 10 

3 case Class Change(time:Time, id:Int, color:String) extends Atom 

4 Case class State(time:Time, id:Int, color:String) extends Atom 

s case Class FullState(time:Time, drive:Set[Int], stop:Set[Int]) extends Atom 

6 State(time, id, color) :- Now(time), CHOOSE(id:Int, allIds), Change(t <= time, id, color) 
7 FullState(time, drive.toSet, stop.toSet) :- State(time,_,_), 

s COLLECT(drive:List[Int], id STH State(time, id, "green")), 

ə  COLLECT(stop:List[Int], id STH (State(time, id, color), color=="red" | | color=="yellow")) 
10 MovingState(time) :- FullState(time, drive, stop), stop.size < drive.size 

u Faulty(time, id, since) :- State(time, id, "red"), Change(since < time, id, "green"), 

2 NOT (Change(t, id, "yellow"), since < t, t < time) 


The scenario comprises traffic lights identified by numbers | to 10 (line 2). In the 
course of time the traffic lights change their colors, and each such event is recorded as 
a corresponding Change atom (line 3). The rule on line 6 computes a State at a current 
time Now(time) as a snapshot of the current colors of all traffic lights. For that, the 
comprehension Change(t <=time,id,color) on line 6 finds the latest Change event before or 
at time for a fixed id chosen from allIds, and binds that time to the (unused) variable t. A 
FullState aggregates the separate State facts at a time partitioned as (Scala) sets of ids of 
“drive” and “stop” colors. In that, the COLLECT special form collects in a Scala List-typed 
variable the specified terms that satisfy the body behind STH. Notice that all atoms in 
FullState refer to the same time, yet the program is SBTP because State comes before 
FullState in predicate stratification. (Predicate stratification is computed automatically 
by Fusemate with Tarjan’s algorithm.) The rule on line 10 demonstrates the use of the 
Scala Set method size in the body. Line 11 demonstrates the use of default negation in 
combination with comprehension. When applied to a given sequence of Change events, 
Fusemate computes models, one-at-a-time, each as Scala set of atoms. 


2 Fusemate Programs 


For the purpose of this paper, a brief summary of the syntactic notions underlying 
Fusemate programs is sufficient; see [7] for details. Terms and atoms of a given signature 
are defined as usual. Let var(z) denote the set of variables occurring in an expression 
z. We say that z is ground if var(z) = Ø. We write zo for applying a substitution o to 


The Fusemate Logic Programming System 591 


z. The domain of ø is denoted by dom(c). A substitution y is a grounding substitution 
for z iff dom(y) = var(z) and zy is ground. In this case we simply say that y is for z. 

Let T be a countably infinite discrete set of time points equipped with a total strict 
ordering < (“earlier than”), e.g., the integers. Assume that the time points, comparison 
operators = and <, and a successor time function +1 are part of the signature and 
interpreted in the intended way. A time term is a (possibly non-ground) term over the 
sub-signature T U {+1}. 

The signature may contain other “built-in” predicate and function symbols for pre- 
defined types such as strings, arithmetic data types, sets, etc. We only informally assume 
that all terms are built in a well-sorted way and that built-in operators over ground terms 
can be evaluated effectively. 

An ordinary atom (with time term t) is of the form p(t,t),...,tn) where p is an 
ordinary predicate (i.e., neither a time predicate nor built-in), £ is a time term and 
ti,...,t, terms. A (Fusemate) rule is an implication written in Prolog-like syntax as 


H :-by,...,b,,notbyy1,..., noth, : (1) 


In (ip. a rule head H is either (a) a disjunction hı V -++ V hm of ordinary atoms, for 
some m > 1, or (b) the expression fail ?| In case (a) the rule is ordinary and in case 
(b) it is a fail rule. A rule body B, the part to the right of :-, is defined by mutual 
recursion as follows. A positive body literal is one of the following: (a) an ordinary 
atom, (b) a comprehension atom (with time term x) of the form p(x of, t),...,t,) sth B, 
where x is a variable, o € {<,<,>,>} and B is a body, (c) a built-in call , i.e., an 
atom with a built-in predicate symbol, or (d) a special form let(x,t), choose(x, ts), 
match(f, s) or collect(x, t sth B) where x is a variable, s, t are terms, ts is a list of terms, 
and B is a body. A positive body is a list b = by,..., bx of positive body literals with 
k > 0. If k = 0 then 5 is empty otherwise it is non-empty. A negative body literal is 
an expression of the form not b, where b is a non-empty positive body. A body is a list 
B = bı,..., bg, not Dral, ..., not bn comprised of a (possibly empty) positive body and 
(possibly zero) negative body literals. It is variable free if var(b1,..., bk) = 0. 

Let r be a rule (ip. We say that r is range-restricted iff var(H) © var(b). Compared 
to the usual notion of range-restrictedness [18], Fusemate rules may contain extra 
variables in negative body literals. For example, p(t,x) :- q(t,x),not(s < t,r(s,x, y)) 
is range-restricted in our sense with extra variables s and y. The extra variables are 
implicitly existentially quantified within the not expression. The example corresponds to 
the formula q(t, x) And, y.(s < tAr(s, x, y)) > p(t, x). Semantically and operationally 
this will cause no problems thanks to stratification, introduced next. 

Fusemate programs — sets of rules — need to be “stratified by time and by predicates” 
(SBTP). The standard notion of stratification by predicates means that the call graph 
of the program contains no cycles going through negative body literals. The edges of 
this call graph are the “depends on” relation between predicate symbols such that p 
positively (negatively) depends on q if there is a rule with a p-atom in its head and 
a g-atom in its positive (negative) body. For disjunctive heads, all head predicates are 


2 This definition of head is actually simplified as Fusemate offers an additional head operator for 
belief revision, see [7]. This is ignored here. 
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defined to depend positively on each other. Every strongly connected component of the 
call graph is called a stratum, and in predicate stratified programs negative body literals 
can occur only in strata lower than the head stratum. 

SBTP is defined as follows: for every rule in a given program, (a) there is a 
variable time that is the time term of some ordinary b € b, (b) if H is an ordinary head 
then every head literal must have a time term constrained to be > than time, and (c) for 
all rule bodies B occurring in the rule: 


(i) the time term of every ordinary or comprehension body literal in B must be con- 
strained to be < than time, and 

(ii) for every negative body literal not b in B (including the top-level body of m and 
every ordinary or comprehension literal b € b , the time term of b must constrained 
to be (i) < than time or (ii) < than time and the predicate symbol of b is in a lower 
stratum than H. 


For the purpose of this paper we only informally assume that all rules contain constraints 
for enforcing the required time ordering properties. There are similar stratification 
requirements for comprehension atoms and special forms so that their evaluation satisfies 
the counterpart of condition (ii) (see below for collect). A fully formal definition could 
be given by modifying the spelled-out definition of SBT in [7]. 

As an example, if r belongs to a lower stratum than p then the following five rules 
all are SBTP, while only the first two rules are SBT. 


p(time, x) :- q(time,x),r(t, y),t < time (2) 
p(time, x) :- (time, x), not(r(t, y),t < time) (3) 
p(time, x) :- q(time, x), not(r(t, y),t < time) (4) 
p(time + 1,x) :- q(time, x), not(r(t, y),t < time) (5) 
p(time, x) :- q(time, x), (p(t < time, y) sth q(t, y)), r(t, y) (6) 


Finally, a (Fusemate) program is a set of range-restricted rules that is SBTP. 


3 Model Computation 


The possible model semantics of disjunctive logic programs associates to a given 
disjunctive program a certain set of normal programs (i.e., without disjunctive heads) 
and takes the intended model(s) of these normal programs as the possible models of 
the given program. These “split” programs represent all possible ways of making one 
or more head literals true, for every disjunctive rule. As a propositional example, the 
program {a :- b, a V c :- b,b :- } is associated to the split programs {a :- b, b :- } and 
{a :- b, c :- b,b :- }. The possible models, hence, are {a, b} and {a, b, c} 

Fusemate computes possible models by bottom-up fixpoint computation and dy- 
namic grounding the program rules in the style of hyper tableaux [8]. The model 
computation procedure is implemented as a variant of the well-known given-clause 
algorithm, which seeks to avoid deriving the same conclusion from the same premises 
twice. It exhausts inferences in an outer loop/inner loop fashion according to the given 
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program’s stratification by time and by predicates. The main data structure is a set of 
paths, where each path represents a partial model candidate computed so far (see [7] for 
more details). Paths are selected, extended, split and put back into the set until exhausted, 
for a depth-first, left-to right inference strategy. Paths carry full status information, which 
is instrumental for implementing incrementality, such that facts with current or later time 
can be added at any stage without requiring model recomputation from scratch. This, 
however, necessitated keeping already exhausted paths for continued inferences later. 


The proof procedure’s core operation is computing a body matcher, i.e., a substitution 
y for arule’s positive body variables so that the rule body becomes satisfied in the current 
partial model candidate. Formally, let J be a set of ordinary ground atoms, representing 
the obvious interpretation that assigns true to exactly the members of J. Let B be a body. 
A body matcher for B is a substitution y for the positive body of B , written as J, y E B, 
such that the following holds (b, B means the sequence of head b and rest body B): 


eke (e is the empty body and € is the empty substitution) 
I,yo = b,B iff y is for b, by € I and I, ø |} By, with b ordinary atom 
I, yo E (p(x z t,ti,..., tn) sthC), Biff y is for p(x, t1,...,t,) and 
(1) p(x, t,.--,tn)y € I, xy < t and I, ô |= Cy for some ô, 
(2) there is no y’ for p(x, f,...,f,) and no ô such that 
D(x, ti,..., tn) y € I, xy < xy’ z t and J,6 = Cy’, and 
(3) 1,0 E By 
I,o -a,B iff a evaluates to true and 7, o — B where a is ground built-in 
I,yo Flet(x,t),B iffy = [x > t] and I, o — By 
I, yo |} choose(x, ts), B iff y = [xh t] and 7,0 } By for some t € ts 
I, yo |= match(t, s), B iff y is for t, ty = s and 1, o H} By 
I, yo  collect(x, t sth C), B iff y = [xh {tô | 1,6 H C}] and Z, o H By 
ILo E not b, B iff there is no 6 such that 7, ô = b, and A,o EB 


A comprehension atom p(x © t,t),...,t,) sth B stands for the subset of all ground 
p-instances in J such that B is satisfied and with a time x as close as possible to t wrt. 
< or <. The cases for > and > are dual and not spelled out above to save space. The 
collect special form collects in the variable x the set of all instances of term f such that 
the body C is satisfied in 7. We require comprehension atoms and collects to be used in a 
stratified way, so that their results do not change later in a derivation when 7 is extended. 
The requirements are the same as with not and can be enforced by ordering constraints. 


The definition above extends the earlier definition of body matchers in with 
the new comprehension construct and the let, choose, match, collect operators. It 
now also enforces left-to-right evaluation of B because the new binding operators de- 
pend on a fixed order guarantee to be useful. An example is the (nonsensical) body 
CHOOSE (x: Int, List(1, 2, 3)), LET(xxx: Int, 3*x), xxx % 2 ==0 which relies on this order. Un- 
defined cases, e.g., when evaluation of a non-ground built-in is attempted, or when a 
binder variable has already been used before are detected as compile time syntax errors. 
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4 Shallow Embedding in Scala 


Fusemate is implemented as a shallow embedding into Scala [25]. It has three conceptual 
main components: a signature framework, a Scala compiler plugin, and an inference 
engine for fixpoint computation as explained in Section [B] The signature framework 
provides a set of Scala class definitions as the syntactical basis for writing Fusemate 
programs. It is parameterized in a type Time, which can be any Scala or Java type that is 
equipped with an ordering and an addition function for time increments, for example Int 
or java.time.OffsetDataTime. The programmer then refines an abstract class Atom of the 
Time-instantiated signature framework with definitions of predicate symbols and their 
(Scala-)sorted arities. See lines (3)-(5) in the program in the introduction for an example. 
These atoms then can be used in Fusemate rules, see lines (6)—(12) in the example. 

While written in convenient syntax, rules are syntactically ill-formed Scala. This 
problem is solved by the compiler plugin, which intercepts the compilation of the input 
file at an early stage and transforms the rules into valid Scala source code] More 
precisely, a rule is transformed into a curried partial function that is parameterized in 
an interpretation context I. The curried parameters are Scala guarded pattern matching 
expression and correspond to the rule’s positive body literals, in order. For example, the 
Faulty rule on lines (11) and (12), with the condition since < time ignored, for simplicity, 
is (roughly) translated into the function f 


ı (I: Interpretation) => { case State(time, id, "red") => { 

2 case Change(since, id1, "green") if id == id1 && 

3 ({case Change(t, id2, "yellow") if id == id2 && since < t && t < time => FAIL} failsOn I) => 
4 Faulty(time, id, since) } } 


Notice the renaming of repeated occurrences of the id variable, which is needed for the 
correct semantics. Notice also that a Scala Boolean-valued expression in an ordinary 
body literal position (e.g., t< time) simply becomes a guard in a pattern. 

The code above can be understood with body matcher computation in mind. Suppose 
the inference engine selects an interpretation J from the current set of paths. For exhaust- 
ing f on J, the inference engine combinatorially chooses literals l4, l2 € Z and collects 
the evaluation results of f(/)(/,)(/2), if defined. Observe that by the transformation 
into Scala pattern matching, body matchers are only implicitly computed by the Scala 
runtime system. Each evaluation result, hence, is a body-matcher instantiated head. 

The rule’s negative body literal is translated into the code on line (3) and conjoined 
to the guard of the preceding ordinary literal. In general, a negative literal NOT body is 
treated by translating FAIL :- NOT body and evaluating the resulting Scala code on J by 
means of the failsOn method. If FAIL is not derivable then NOT body is satisfied. Again, 
appropriate bindings for the variables bound outside of body are held implicitly by the 
Scala runtime system. The translation of the special forms and comprehension is not 
explained here for space reasons. Fusemate can show the generated code, though. 


3 Early experiments showed it is cumbersome and error-prone to write the Scala code by hand, so 
this was not an option. The compiler plugin is written in Scala and operates at the abstract syntax 
tree level. This was conveniently be done thanks to a sophisticated quasiquote mechanism. 
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Properties and Advantages 


The shallow embedding approach enables introspection capabilities and interfacing 
between the rule language and the host language beyond what is implemented in other 
systems. In Fusemate, the terms of the logical language are nothing but Scala objects. 
As a consequence, any available Scala type or library data structure can be used as a 
built-in without extending an “interface” to an extension language — simply because 
there is none. Dually, the embedding of the rule language into the host language Scala 
is equally trivial because rules, atoms and interpretations are Scala objects, too. 

It is this “closed loop” that makes an aggregation operator (collect) possible that 
returns a list of Scala objects as specified by the programmer, e.g., a list of terms or 
atoms []This list can be further analysed or manipulated by the rules. See the description 
logic embedding in Section[5] which critically depends on this feature. This introspection 
capability stands out in comparison to the logic programming systems mentioned in the 
introduction. For instance, aggregation in systems like DLV [T], and IDB is limited 
to predefined integer-valued aggregates for sum, count, times, max and min. 

Most logic programming systems can be called from a (traditional) host program- 
ming language and can call external systems or utilize libraries for data structures. The 
DLV system, for instance, interfaces with C++ and Python [22], Prova with Java, 
and IDP with the Lua scripting language. Systems based on grounding (e.g., DLV and 
IDP) face the problem of “value invention” by external calls, i.e., having to deal with 
terms that are not part of the input specification [T0]. 

The main issue, however, from the Fusemate perspective is that these systems’ 
external interfaces are rather heavy-handed (boilerplate code, mapping logic terms 
to/from the host language, String representation of logic programs) and/or limited to 
a predefined set of data structures. In contrast, Fusemate’s seamless integration with 
Scala encourages a more integrated and experimental problem solving workflow. The 
following Scala program demonstrates this point with the traffic light example: 


1 List("2020-07-02T10:00:00,1,green", .., "2020-07-02T10:02:15,2,red") 
-map {_.split(",") } // Split CSVs intos triple, represented as Java array 


nN 


3 .map {// Convert String triple to positive Change literals 

4 case Array(date,id,color) => Change(LocalDateTime.parse(date), id.toInt, color) } 
5 .Saturate { rules } // saturate is the Fusemate call, computes all models of the rules 

6 -head // Select the first model 

7 .toList // Convert to Scala List because we want to sort elements by time: 

8 .sortBy { _.time } 

9 flatMap { // Analyze literals in model and retain only Faulty ones as CSV 

10 case Faulty(time, id, since) => List(s"$time, $id, $since") 


ul case _ => List() } 


From a workflow perspective, this program integrates Fusemate as a list operator (on a 
list of Change instances) in an otherwise unremarkable functional program. 


4 Technically, this is possible because the current interpretation is available in the rule body 
through the parameter I (see the transformation example above). One could directly access I, 
e.g., as in CHOOSE(a: atom, I), MATCH(State(t,3,c), a), t>10, c !="red" 
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For a more realistically sized experiment I tried a combined Fusemate/Scala work- 
flow for analysing the data of the DEBS 2015 Grand Challenge? The data comprises 
two millions taxi rides in New York City in terms of start/end times, and start/end GPS 
coordinates, among others. The problem considered was to detect anomalies where a 
taxi driver drivers away from a busy hotspot without a passenger. Solving the problem 
required clustering locations by pickup/drop-off activity for determining hotspots, and 
then analysing driver behavior given their pickups/drop-offs at these hotspots. 

Two million data points were too much for Fusemate alone and required Scala 
preprocessing, e.g., for filling a grid abstraction of New York coordinates, data cleansing 
and filtering out little active drivers. Fusemate was used for computing clusters with 
rules similar to transitive closure computation. Input to Fusemate calls were Scala 
precomputed point clouds. The computed clusters were used to analyze Scala prefiltered 
taxi rides for anomaly detection based on the clusters. This involved three moderately 
complex rules, for first identifying gaps and then analysing them. The comprehension 
operator was useful to find “the most recent ride predating a given start’, among others. 
The longest Fusemate run was 0.31sec for 64 rides (with 39 clusters fixed), most other 
runs took less than 0.15sec. Fusemate’s performance was perfectly acceptable in this 
experiment thanks to a combined workflow. 


5 Embedding Description Logic ALCIF 


ALCIF is the well-known description logic ALC extended with inverse roles and 
functional roles. (See [3] for background on description logics.) This section describes 
how to translate an ALCLF knowledge base to Fusemate rules and facts for satisfia- 
bility checking. 

This is our example knowledge base, TBox on the left, ABox on the right: 


Person E Rich u Poor Anne : Person M Poor 
Person E ofather.Person (Anne, Fred) : father 
Rich E Vfather™!.Rich Bob : Person 
Rich Poor E L (Bob, Fred) : father 


The father role is declared as functional, i.e., as a right-unique relation, and father”! 


denotes its inverse “child” relation. The third GCI says that all children of a rich father 
are rich as well. In all models of the knowledge base Fred is Poor. This follows from the 
given fact that his child Anne is poor, functionality of father and the third CGI. However, 
there are models where Bob is Rich and models where Bob is Poor. 

Translating description logic into rule-based languages has been done in many ways, 
see e.g. (20[17[14]11]. An obvious starting point is taking the FOL version of a given 
knowledge base. Concept names become unary predicates, role names become binary 
predicates, and GCIs (general concept inclusions) are translated into implications. By 
polynomial transformations, the implications can be turned into clausal form (if-then 
rules over literals), except for existential quantification in a positive context, which 


5 http://www.debs2015.org/call-grand-challenge.html 
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causes unbounded Skolem terms in derivations when treated naively (for example, the 
third CGI above is problematic in this sense). This is why many systems and also the 
transformation to Fusemate below avoid Skolemization. 

The first GCI corresponds to the clause Person(x) — Rich(x) v Poor(x), and the 
second corresponds to the “almost” clause Person(x) — Ay.(father(x, y) A Person(y)). 
Fusemate works with the reified rule versions of these, with an IsA-predicate for concept 
instances, and a HasA-predicate for role instances. For the whole TBox one obtains the 
following, where RN stands for “role name” and CN stands for “concept name” [>| 


1 IsA(x, Exists(RN("father"), CN("Person")), time) :- IsA(x, CN("Person"), time) 

2 IsA(x, CN("Rich"), time) OR IsA(x, CN("Poor"), time) :- IsSA(x, CN("Person"), time) 
3 IsA(x, Forall(Inv(RN("father")), CN("Rich")), time) :- IsA(x, CN("Rich"), time) 

4 FAIL:- IsA(x, CN("Poor"), time), IsA(x, CN("Rich"), time) 


s functionalRoles = Set(RN("father")) 


Every GCI can be converted into rules like the above without problems. For that, 
starting from its NNF, 4-quantifications in the premise of a rule can be expanded in 
place, and V-quantifications can be moved to the head as the 4-quantification of the 
NNF of the negated formula. Similarly for negated concept names. See for such 
transformation methods. The ABox is represented similarly. Its first element, for instance, 
is ISA(Name(" Anne"), And2( CN("Person"), CN("Poor")), 0). 

In addition, some more general “library” rules for the tableau calculus are needed: 


1 IsA(x, c1, time) AND IsA(x, c2, time) :- IsA(x, And2(c1, c2), time) 

2 IsA(x, c1, time) OR IsA(x, c2, time) :- IsA(x, Or2(c1, c2), time) 

3 // Expansion rules for quantifiers 

4 IsA(y, c, time) :- Neighbour(x, r, y, time), IsA(x, Forall(r, c), time) 

s HasA(x, r, rSuccOfx, time+1) AND IsA(rSuccOfx, c, time+1): @preds("TimePlus1") :- 
6 — IsA(x, Exists(r, c), time), ! (functionalRoles contains r), 

7 NOT(Neighbour‘(x, r, y, time), IsA(y, c, time) ), NOT( Blocked(x, _, time) ), 

s  LET(rSuccOfx: Individual, Succ(r, x)) 


10 _HasA(x, r, rSuccOfx, time+1) AND IsA(rSuccOfy, c, time+1): @preds("TimePlus1") :- ( 
u__ IsA(x, Exists(r, c), time), functionalRoles contains r, 

12 NOT(Neighbour(x, r, y, time) ), NOT( Blocked(x, _, time) ), 

13 LET(rSuccOfx: Individual, Succ(r, x)) 


1s IsA(y, c, time) :- 
16  IsA(x, Exists(r, c), time), functionalRoles contains r, 
17 Neighbour(x, r, y, time) 


The expansion rules on lines 1 and 2 deal with the ALC binary Boolean connectives 
And2 and Or2 in the obvious way. Supposing NNF of embedded formulas, no other 
cases can apply. The remaining rules can be understood best with the standard tableau 
algorithm for ALCLN in mind, which includes blocking to guarantee termination. 
They follow the terminology in [6] Chapter 4]. The Neighbour relation abstracts from 
the HasA relation, left away for space reasons. The expansion rule for 3 comes for three 
cases. The first case (line 5), for example, applies to non-functional roles as per the 
Scala builtin test on line 6. The expansion of the given -formula only happens if it 


6 See the Fusemate web page for the full, runnable code. 
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is not yet satisfied and in a non-blocked situation (line 7). In this case the rule derives 
a Skolem object defined on line 8 for satisfying the 4-formula. Notice the annotation 
@preds("TimePlus1") which makes sure that the head is on the highest stratum. This way, 
the rule will be applied after, in particular, the rules for blocking. Furthermore, with 
the time stamp time +1 the Skolem object is kept separate from the computations in the 
current iteration time. The blocking rules are defined as follows: 


// Collect all concepts that an individual x isA, at a given time 

Label(x, cs.toSet, time) :- IsA(x, _, time), COLLECT(cs: List{Concept], c STH IsA(x, c, time)) 

// Ancestor relation of Skolem objects introduced by exists-right 

Anc(x, Succ(r, x), time) :- HasA(x, r, Succ(r, x), time) 

Anc(x, Succ(r, z), time) :- HasA(z, r, Succ(r, z), time), Step(time, prev), Anc(x, z, prev) 

// Blocked case 1: y is blocked by some individual x 

Blocked(y, x, time) :- Label(y, yIsAs, time), Anc(x, y, time), Label(x, xIsAs, time), yIsAs == xIsAs 
// Blocked case 2: y is blocked by some ancestor 

Blocked(y, x, time) :- Anc(x, y, time), Blocked(x, _, time) 
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Some additional rules are needed for dealing with basic inconsistencies and for 
carrying over IsA and HasA facts between iterations. They are not shown here. 

The expansion rules and blocking rules follow the tableau calculus description in [6] 
Chapter 4]. One important detail is that the expansion rule for 4 must be applied 
with lowest priority. This is straightforward thanks to Fusemate’s stratification and 
aggregation construct. Equally important is the access to (Scala) data structures via 
built-ins and using them as terms of the logical language. This made it easy to program 
Skolemization and the Label relation for collecting sets of concepts of an individual. 


6 Conclusions 


This paper described recent developments around the Fusemate logic programming 
system. It included new technical improvements for a weaker form of stratification, which 
enabled useful aggregation and comprehension language constructs. It also argued for 
the advantages of the tight integration with Fusemate’s host language, Scala, in terms 
of data structures and usability. 

Answer set solvers like DLV and SModels are designed to solve NP-complete or 
higher complexity search problems as fast as possible. Fusemate is not motivated as a 
competitive such system, it is motivated for "well-behaved" knowledge representation 
applications, similarly to description logic reasoners, whose (often) NExpTime com- 
plete solving capabilities are not expected to be typically needed. (Some more work is 
needed, though, e.g., on improving the current term indexing techniques to speed up 
model computation.) More specifically, the main intended application of Fusemate is 
for the runtime analysis of systems that evolve over time. The taxi rides data experiment 
explained in Section[]is an example for that. It suggests that Fusemate is currently best 
used in a combined problem solving workflow if scalability is an issue. 

As for future work, the next steps are to make the description logic reasoner of 
Section [5] callable from within Fusemate rules in a DL-safe way and to embed a 
temporal reasoning formalism. The event calculus seems to be a good fit. 


Acknowledgements. I am grateful to the reviewers for their helpful comments. 
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Abstract. Twee is an automated theorem prover for equational logic. It 
implements unfailing Knuth-Bendix completion with ground joinability 
testing and a connectedness-based redundancy criterion. It came second 
in the UEQ division of CASC-J10, solving some problems that no other 
system solved. This paper describes Twee’s design and implementation. 
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1 Introduction 


Twee is an automated theorem prover for equational logic, available as open- 
source software [I7]. It features good performance (coming second in the UEQ 
division of CASC-J10), low memory use, and human-readable proof output. 
Twee’s general architecture is quite traditional: it uses a DISCOUNT loop 
implementing unfailing Knuth-Bendix completion [3]. However, it has a few 
characteristics which are unusual in a high-performance theorem prover: 


Fized heuristics. Twee does not adjust its strategy based on the input problem. 
It uses a fixed term order, a fixed critical pair scoring function, and so on. Rather 
than detecting the kind of problem, Twee uses general-purpose strategies that 
work for all sorts of problems (Section [2). 


Strong redundancy tests. Rather than using special strategies for associative- 
commutative functions, Twee builds in strong redundancy tests, based on ground 
joinability and connectedness (Section fB). These handle not just AC functions 
but many kinds of unorientable equations, in particular permutative ones (where 
both sides are almost the same but with variables in a different order). 


A high-level language. Twee consists of 5300 lines of Haskell code, whereas for 
example Waldmeister is 65000 lines of C. As such, it is easy to experiment 
with. Despite the choice of programming language, Twee is quite fast at raw 
deduction steps, thanks to careful coding of low-level term operations (Section 4). 

Despite the fixed heuristics and high-level language, Twee comes close in 
performance to E [14] and Waldmeister [I2]. It is strong in many problem classes, 
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including LAT (lattices) and REL (relation algebra) from TPTP, which feature 
many commutative operators where Twee’s redundancy tests shine, and on 
unusual problems, where no prover has special heuristics. Twee is however poor 
at RNG (rings), where it seems important to choose a good term order. The rest of 
the paper describes Twee’s design in detail, focusing on the three aspects above. 


Notation. We use t = u to mean that t and u are syntactically equal. 


2 Architecture 


Twee natively supports only unit equality problems with ground goals, but the 
frontend also supports arbitrary quantification, Horn formulas, and many-sorted 
logic. These features are eliminated using the external tool Jukebox [I6], which: 


— Clausifies the problem to eliminate conjunction and quantifiers. 
— Encodes Horn clauses as equations [5]. 
— Encodes sorts using extra functions [4]. 


At this point, the goal can still contain existentially-quantified variables, which 
must be eliminated. To do so, we use an old trick, also used by Waldmeister: if 
the goal t = u is non-ground, we add new function symbols eq, true and false, 
and two axioms VX. eq(X,X) = true and eq(t,u) = false, and replace the goal 
with true = false. Now we have a unit equality problem with a ground goal. 

The main proof loop is shown in Algorithm [I] It implements unfailing com- 
pletion [3] using a DISCOUNT loop [7]. The state consists of R, a set of rewrite 
rules and unorientable equations (the active set, initially empty); Q, the set of 
unprocessed critical pairs formed from R (the passive set, initially containing all 
the axioms); J, a set of ground joinable equations used for subsumption checking 
(following [I]); and the goal. The main loop removes the best critical pair from Q 
(see below), and if it is not redundant, adds it to R (oriented if possible) and adds 
all its critical pairs to Q. Every so often, the rules in R are reduced with respect 
to one another and redundant rules are removed. The goal is kept normalised 
with respect to R and the prover succeeds if the goal becomes trivial] 

The passive set is normally quadratic in the size of the active set: typical 
numbers are |R| ~ 10,000 and |Q| ~ 10,000,000. Hence we must process each 
passive critical pair at high speed, but can spend time on each new rewrite rule. 


Term ordering. We always use KBO, with all functions having weight 1, and 
ordered so that more frequently-occuring functions are smaller. 


Critical pair selection. When a critical pair is added to Q, it is first normalised 
and then assigned a score; the proof loop selects the critical pair with the lowest 
score. The score function’s job is to pick out promising critical pairs, and the 
choice of score function can make or break the prover. However, as it is applied 
to every critical pair, it also needs to be fast. We compute scores as follows: 


1 An equation is considered trivial if it is of the form t = t. 
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Algorithm 1 The main proof loop 


(R, J,Q) = (B,D, A) 
while Q 4 Ø do 
P = remove lowest-scoring element of Q 
if P’s parent rules are still present in R then 
normalise P using R to get t = u 
if t Æ u and t = u is not connected and t = u is not subsumed by J then 
if t = u is ground joinable then 
add t =u to J 
else 
orient t = u and add it to R 
for all critical pairs cp of t = u and R do 
normalise cp using only the oriented rules in R 
if cp is non-trivial then add cp to Q end if 
end for 
normalise goal using R 
if goal is trivial then return “theorem” end if 
simplify rules in R wrt each other, but limit this step to 5% of total runtime 
end if 
end if 
end if 
end while 
return “countersatisfiable” 


— We start with a weighted sum of the size of the two terms. By default we take 
4 weight(t) + weight(u), where t is the bigger term and u the smaller. In other 
words, the size of the bigger term is most important. Variables are weighted 
slightly less than function symbols, to encourage finding more general rules. 

— To encourage Twee to use all the axioms, we add the critical pair’s depth, 
where axioms have depth 0, critical pairs of the axioms have depth 1, etc. 

— If a term contains the same subterm multiple times, only one occurrence of 
that subterm is counted; the other occurrences get a nominal weight of one 
symbol. In effect we measure the weight as if the term was a DAG rather 
than a tree. The idea is that identical subterms form the same critical pairs, 
and tend to get rewritten at the same time: they come and go together. 

— Finally, any critical pair of the form eq(v, w) = false (where eq is the function 
used to encode existential goals) with v and w unifiable is given a fixed cost 
of 1, because selecting it will immediately prove the goal. This trick is also 
used by Waldmeister, and is vital in practice for existential goals. 


Proof production and checking. Twee uses an LCF-style kernel [9] to guarantee 
soundness. Every member of the active set comes with a proof object, which 
is verified by a trusted proof checker (consisting of about a page of code). The 
proofs are low-level and thus easy to check: the only proof steps allowed are 
reflexivity, symmetry, transitivity, congruence and applying an axiom or lemma. 
It is not possible to add a rule to the active set without supplying a proof, and 
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any invalid proof step causes a fatal runtime error. The key to making this fast 
is that only the active set, not the passive set, includes proof objects. 

Once the goal is proved, we transform the proof object into a human-readable 
proof, consisting of a flat sequence of rewrite steps. We also introduce lemmas, 
to avoid exponentially-sized proofs: any active rewrite rule is a candidate lemma. 
Our approach is similar to [8], but simpler as our proof steps are smaller; but 
their lemma selection strategy is smarter than ours and produces fewer lemmas. 


Goal transformation. Twee’s frontend can optionally transform the problem to 
make the prover more goal-directed. The transformation is simple, but strange. 
For every function term f(...) appearing in the goal, we introduce a fresh 
constant symbol a and add the axiom f(...) = a. For example, if the goal is 
f(g(a),b) = h(c), we add the axioms f(g(a),b) = di, g(a) = d2, and h(c) = ds. 
Simplification will rewrite the first axiom to f(d2,b) = dı and the goal to dı = ds. 

By doing this transformation, (1) any subterm of the goal gets normalised to 
a constant, so critical pairs containing goal terms get a lower score, and (2) new 
critical pairs involving these constants appear, which are likely to be relevant to 
the goal. We evaluate this transformation in Section [5] 


Weak rewrite rules. Completion sometimes deduces equations where both sides 
have a variable not occurring on the other side, such as f(x,y) = g(a, z). Such 
equations are awkward for rewriting: suppose we want to use this equation to 
rewrite the term f(t, u)—what value should we choose for z? 

Twee splits this equation into nicely-behaved rewrite rules instead. To do so, 
we introduce the concept of a weak rewrite rule. A weak rewrite rule t ~~ u is 
like an ordinary rewrite rule, except that it only satisfies t > u, not t > uf] Weak 
rewrite rules form critical pairs and participate in rewriting just like any other 
rewrite rule, except that to ensure termination, we may only perform the rewrite 
step to ~~ uo if to # ug, i.e. to and uo are syntactically different terms] 

Using weak rewrite rules, Twee splits f(x,y) = g(x,z) into the two rules 
f(x,y) > g(a, L) and g(a, z) ~~ g(x, L), where L is the minimal term in the 
term ordering. Note that g(x, z) ~~ g(a, L) is a valid weak rewrite rule because 
g(x,z) > g(x, L), with equality exactly when z = L. 

As another example, the equation f(z,2,y,z) = g(x, y,y,w) is split into 
f(x,x,y,Ll) = g(x,y,y, L), f(x,x,y,z) ~~ f(x,x,y,L) and g(x,y, y, w) ~~ 
g(x,y, y, L). In this case, we are still left with an unorientable rule afterwards, 
but since it has the same variables on both sides it is unproblematic for rewriting. 

It is always possible and safe to split an equation into an equivalent set of: 

— ordinary rewrite rules t > u with t > u, 
— weak rewrite rules t ~> u with t > u, and 
— unorientable equations t = u where both sides have the same set of variables. 


Twee does this whenever an equation is about to be added to R. 


2? t> u means: for all grounding substitutions o, either to > uo or to = uo. 
3 This is different from e.g. constrained rewriting: we can perform the rewrite even if t 
and u are unifiable, as long as they are not the same term right now. 
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3 Redundancy Criteria 


The basic redundancy criterion of Knuth-Bendix completion is joinability: a 
critical pair can be discarded if both sides normalise to the same term. Joinability 
runs into problems when we have unorientable equations. For example, consider 
a rewrite system for an associative-commutative operator “+”: 


r+y=y+te (1) 
(c+y)+2z2724+(y+2) (2) 
x+ (y+z)=y+(z+z2) (3) 


From (1) and (2) we get the critical pair z + (y+ z) (a+y)+z z+(x+y), 
which cannot be rewritten any further so it is not joinable. However, the critical 
pair is redundant, because the above rewrite system is ground confluent. We 
would like to detect redundant but non-joinable critical pairs. 

This section presents the redundancy criteria that Twee uses to handle 
unorientable equations: our take on the well-known approach of ground joinability 
testing [6], and a novel (we believe) approach based on connectedness [2]. Unlike 
the standard techniques for associative-commutative functions [I], our criteria 
handle any kind of permutative equation; we evaluate our approach in Section 


3.1 Ground Joinability Testing 


Although the critical pair x + (y+ z) — (x + y) + z —> z + (x + y) is not joinable, 
all ground instances of it are joinable, and we say that the critical pair is ground 
joinable. For example, the instance a + (b + c) — (a + b) + c —> c + (a + b), with 


a< b< c, can be joined TE TA LEE E ETES hee), Any 


ground joinable critical pair is redundant. 
Martin and Nipkow [13] suggest an approach for checking ground joinability: 


— Consider all possible orderings between the variables of the critical pair, such 
asu<y<z,y<uv=z,u=y < z, and so on. 

— For each ordering, show that the critical pair is joinable when the variables 
have that order. Formally, this means showing that all ground instances 
satisfying the ordering are joinable. For example, the rewrite proof above 
shows our critical pair joinable for any ground instance satisfying £ < y < z. 


Their algorithm effectively does a case analysis on all possible variable order- 
ings, but it is inefficient because there are so many possible orderings. 

Our algorithm is similar, but tries to minimise the number of cases it considers. 
It does so by allowing orderings that: (1) constrain only a subset of the variables, 
such as x < y, and (2) use <, as in x < y < z. It works as follows: 


1. Choose a strict total order on all the variables, using only <; e.g., £ < y < Z. 
2. Show that the critical pair is joinable under that ordering. Formally, we show 
that all ground instances satisfying the ordering are joinable. 


Twee: An Equational Theorem Prover 607 


3. We have now shown that the critical pair is joinable in one specific case. Now 
generalise that case, by: (1) removing variables from the ordering, and (2) 
replacing < with < in the ordering, as long as the critical pair is joinable 
under the resulting ordering. (This may e.g. generalise x < y < z to x x z.) 

4. Repeat, but pick an ordering that is not covered by any of the cases so far. 

5. When all variable orderings involving only < have been covered, all the ones 
that remain must involve =. For each such ordering, take the critical pair, 
unify all equal variables, and recursively call the ground joinability check. 


Example. Take the critical pair x+ (y+z) — (x+y)+z —> z+(x+y) and suppose 
that we choose the ordering x < y < z. It can be joined when this order holds, as 
for any inst h h + 

y instance where x < y < z, we have z+(x+y) r+(z+y) x+(y+z). 

Having joined the critical pair in one case, we now generalise the case. We 
first try to remove each variable in turn, i.e. to join the critical pair in the three 
cases £ < y, y < z, and x < z in turn. None of these attempts succeeds. 

Now we try replacing a < with a <, to get x < y < z. We must check if all 
ground instances satisfying x < y < z are joinable, but how? We might think of 
splitting this into two cases x < y < z and x < y = z, but instead we are going 
to find one rewrite proof that works for both. 

Consider the rewrite proof above. In it, the step x + (z + y) > x + (y + z) is 
fine if y < z, but does not seem to be allowed if y = z. But in fact it is fine: if 
y = z, the terms x + (z + y) and «+ (y + z) are identical, so this rewrite step does 
nothing and can just be dropped. That is, the proof works both when x < y < z 
and x < y = z, and shows joinability for the case x < y < z. We generalise the 
other < similarly, showing that the critical pair is joinable in the case x < y < z. 

Next, we pick another total order on the variables, but not one in which 
x < y x z. We might pick, for example, z < y < x. The process repeats: we show 
ground joinability under this ordering, and generalise it to z < y < x. We repeat 
until all cases are covered, and the ground joinability test succeeds. 

Although our algorithm can be expensive in theory, in practice it needs to 
consider only a few orderings, and a small number of variables. Step (5) can 
occasionally be expensive, but by generalising < to < we can usually avoid it. 


The general case. Here is how we test joinability under a given variable ordering. 
First, we parameterise our term order. Given an ordering C, we define t >c u to 
mean that, for all grounding substitutions ø, if ø satisfies C then to > ua. 

In the example, we weakened a < to a <. To do so, we used a rewrite step 
that, in some ground instances, rewrote a term to the same term. To allow these 
kind of steps, we loosen our definition of rewriting: we may perform a rewrite 
t — u under C as long as t >c u and t Æ u. Rewriting terminates because given 
a rewrite proof t >c u >ç v Sc ..., there is always a ground instance where 
t >c u! >c v >c..., since C was constructed as a strict order in step (1). 

With this definition, normalising z+ (x +y) using the ordering C := x < y < z 
yields z + (x +y) > z + (z +y) > x + (y + z), where e.g. the first step is allowed 
because z +x >c x+ z and z +x Æ x+ z. Thus we can join our example critical 
pair under a given variable ordering just by normalising both sides, as we want. 
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The last ingredient is to implement a test for t >ç u, which we have done for 
KBO. The tricky part is checking whether weight(t) > weight(u), which can be 
solved by taking the expression weight(t) — weight(u), a linear combination of 
the weights of t’s and w’s variables, and computing its minimum possible value. 

One nice property is that the rest of the ground joining code is independent 
of the term order. To support e.g. LPO, one just needs to implement >c for it. 


Why not allow arbitrary ordering constraints? Some critical pairs can only be 
ground joined by using ordering constraints on arbitrary terms (e.g. x + y < z). 
We do not support these, as they make everything enormously more complex: 


— The number of possible orderings becomes infinite. You can get stuck enu- 
merating more and more cases of a case split which never ends. In our design, 
there are finitely many orderings and the algorithm clearly terminates. 

— Computing >ç for KBO becomes NP-complete [10]. In our setting, it takes 
polynomial time, and we expect it can be done in linear time following [IT]. 


3.2 Connectedness 


Ground joinability testing is rather heavyweight, constructing and analysing a 
sometimes large case split, and sometimes it fails because it only supports case 
splits on variables. Twee also supports a simpler, complementary method that 
works well when an unorientable equation is applied under another function. 

The method makes use of connectedness. A critical pair s — t > u is connected 
if there is a rewrite proof s = tı =... = tn = u such that each t; is strictly less 
than ż [2]. In Knuth-Bendix completion, any connected critical pair is redundant. 
In other words, when joining s — t — u, we can do rewrite steps that increase 
the term, as long as the result is always strictly less than t. 

Here is how we use connectedness. Let o be a substitution that grounds s 
and u. When joining s — t > u, we may want to perform a rewrite step v > w 
using an unoriented equation, but we don’t know if v > w. We allow the rewrite 
step v > w as long as: (1) w < t, and (2) va > wo. Condition (1) ensures 
connectedness, and condition (2) ensures that rewriting eventually terminates. 

For example, suppose we take the earlier rules for “+” and add a function f: 


fæ +y,z + w) > f(a, F(z, fy, w))) (4) 
f(x, F(u, 2)) = Fy, F(a, 2)) (5) 


Assume KBO with both f and + having weight 1. One critical pair is 
fu. Fe Saw) © flyet w) E pesy,2+v) B fe, te, fiw). 
We can show this to be connected using o = {x > a,y > b,z > cw => d}, 
a<bx<cx< d. The left term f(y, f(z, f(x,w))) rewrites to f(y, f(x, f(z,w))) 
using (5), because f(y, f(x, f(z,w))) < f(a +y,z + w) (connectedness) and 
f(b, f(c, f(a, d))) > f(b, f(a, f(e,d))) (termination); and that rewrites to 
f(x, f(y, f(z, w))) similarly. The right term f(x, f(z, f(y, w))) also rewrites to 
f(x, f(y, f(z, w))). Thus the critical pair is redundant. 
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In general we try two choices of g: one where the first variable in s = u is 
mapped to a, the second to ag, and so on (with a; < ... < an); and another 
where the variables are mapped in reverse order. The critical pair is redundant 
if either choice of ø works. This is not a principled choice—most likely, some 
critical pairs need a different o—but we do not know how to find the “best” o. 


4 Implementation 


Twee consists of 5300 lines of Haskell code, comprising: terms, unification etc. 
(1150 lines); the frontend (850 lines); proof output (700 lines); general data 
structures (700 lines); the main proof loop (600 lines); joining, ground joining 
and connectedness (500 lines); critical pairs and the passive set (400 lines); term 
indexing (250 lines); and KBO (150 lines). This does not include TPTP parsing, 
clausification, etc., which are provided by the 4000-line Jukebox [I6] program. 
Most of Twee is written in a high-level, Haskell-idiomatic, somewhat inefficient 
style. Performance-critical parts (term manipulation, term indexing, and the 
passive set) are coded more carefully, and are described below. The bottleneck is 
usually normalising the many millions of critical pairs that are generated. 


4.1 Terms 


The simplest way to represent terms in Haskell, as trees, is not ideal: it creates 
pressure on the garbage collector, and core operations such as matching and 
unification become heavily recursive and needlessly slow. 

Instead, we represent terms as flatterms—the term is flattened into a list of 
symbols and stored in an array. In order to preserve the structure of the term, 
each symbol is paired with a number giving the size of the subterm rooted at 
that symbol. For example, the term f(x, g(x, y)) is represented as: 
f:5la:1ig:3]a:1y:1 
where e.g. g : 3 indicates a subterm with root g that is 3 symbols long (g, x, y). 

In addition, each function and variable has an ID number, and the term stores 
those ID numbers, rather than a pointer to the function or variable. So, in the 
array above, the “f” really means the ID number of f. Functions have positive ID 
numbers, and variables negative, so they can be easily told apart, and there is a 
separate global array which maps ID numbers to functions. This design allows us 
to represent a term as a simple array of integers, so that pressure on the garbage 
collector is reduced. Also, comparing two terms for equality just amounts to a 
bytewise comparison of the arrays (a C memcmp). What’s more, by using array 
slicing, we can view a term’s subterms as flatterms in their own right. 

On top of this we build a higher-level API. There are two types, terms and 
termlists, both implemented as flatterms. With the help of Haskell’s user-defined 
patterns, they are exposed to the user as ordinary algebraic datatypes. We can 
use normal pattern matching to e.g. check if a term is a function or variable, 
access its children (as a termlist), iterate through it a symbol or subterm at a 
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time, etc. All these operations turn into a few machine instructions. Matching 
and unification are implemented using this API as efficient tail-recursive loops. 


4.2 Indexing 


Rewriting uses a perfect discrimination tree [I5], including Waldmeister’s refine- 
ments [12]. The implementation takes care not to create backtracking points 
unless needed. There is no unification index, since this is not usually a bottleneck. 


4.3 The Passive Set 


Early versions of Twee often ran out of memory after about 30 minutes. The 
reason is the passive set—it grows quadratically in the number of active rules, 
because any pair of rules can have a critical pair. In typical prover runs it contains 
anywhere between a million and a hundred million critical pairs. 

Twee now uses a space-efficient passive set representation adapted from 
Waldmeister [12]. The main idea is to throw away all terms involved in the 
critical pair, and only remember: (1) the ID numbers of the two rules involved, 
(2) the position of the overlap, and (3) the score of the critical pair. When a 
critical pair is selected, the ID numbers and position are used to reconstruct 
the critical pair. This design uses about 12 bytes of memory per critical pair, so 
Twee can run for many hours without running out of memory. 


5 Evaluation 


In this section we report on two evaluations: one investigating the effect of the 
different redundancy criteria of Section B} and one comparing the performance 
of Twee against E 2.5 and Waldmeister. In both cases we ran Twee on all 981 
unsatisfiable UEQ problems from TPTP 7.4.0, with a time limit of 5 minutes. 


Redundancy criteria. Figure shows how the performance of Twee varies 
depending on which redundancy criteria are enabled. The x-axis shows the 
number of problems solved (starting from problem 600) and the y-axis shows 
the runtime for that problem. The combination of ground joinability testing and 
connectedness is much stronger than either on their own—it seems that each 
catches cases that the other misses. It is clearly best to have both switched on. 

The figure also includes a variant of Twee which implements the heuristic 
for AC functions described in [i] (and no other redundancy criterion), which 
solves fewer problems than our approach. This is perhaps not surprising, as our 
approach handles a wider class of functions. 


Twee, E, Waldmeister. Figure [Ib] compares Twee’s performance against E and 
Waldmeister. Twee is run in three variations: with and without the goal-directed 
transformation from Section [] and as a timesliced version which runs the other 
two versions for 150s each. By far the best choice for Twee is to timeslice, when 
it comes close to Waldmeister’s performance. This suggests that Twee with and 
without the goal transformation solve somewhat different sets of problems. 
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600 650 700 750 800 850 900 700 750 800 


No connectedness, no ground joining + Twee + 
Connectedness only >% Twee with goal transformation 
Ground joining only O Twee with timeslicing O 
Connectedness + ground joining Q Waldmeister 2 
Waldmeister-style AC heuristic E 
(a) Different redundancy criteria. (b) Compared against Waldmeister and E. 


Fig. 1: Benchmarks. 


6 Future Work 


Knuth-Bendix completion pays little attention to the goal: it simply completes 
the rewrite system until the goal becomes trivial. We plan to search for ways 
to make Twee more goal-directed, for example by rewriting the goal backwards 
somewhat in the style of [18]. The success of the goal transformation shows that 
goal direction ought to be important. 

Twee uses a fixed term ordering, which is clearly a weakness on certain 
problem kinds such as RNG. We do not want to choose a term order based on 
syntactic analysis of the problem, but would like to choose it dynamically based 
on the state of the proof, perhaps by incorporating ideas from MædMax {19}. 


7 Conclusion 


Twee is a unit equality prover implemented in 5300 lines of Haskell code. Its 
performance is good, thanks to a careful implementation, strong redundancy 
criteria and a transformation to help goal-directness. It performs particularly 
strongly on problems involving permutative laws, such as those in LAT and REL. 
Its main weaknesses are that it always uses a fixed term order, and has only weak 
goal direction. We hope that a future version of Twee, with real goal direction 
and a smart choice of term order, will be even stronger. 
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Abstract Naproche is an emerging natural proof assistant that accepts 
input in the controlled natural language ForTheL. Naproche is included 
in the current version of the Isabelle/PIDE which allows comfortable 
editing and asynchronous proof-checking of ForTheL texts. The .tex 
dialect of ForTheL can be typeset by TEX into documents that approx- 
imate the language and appearance of ordinary mathematical texts. 


1 Introduction 


Naproche (for Natural Proof Checking) is an emerging natural proof assistant 
that accepts input in a controlled natural language, approximating ordinary 
mathematical language and texts. The system uses 


— the dedicated input language ForTheL (Formula Theory Language), 

— natural language processing for texts with symbolic material, 

— strong automatic theorem proving (ATP) for filling in implicit or obvious 
proof steps. 


The current version of Naproche also introduces a TFX dialect of ForTheL so 
that high-quality mathematical typesetting is readily available. Naproche allows 
the formalization and proof-checking of advanced mathematics in a style that is 
immediately readable by mathematicians. Example formalizations from various 
domains of undergraduate mathematics are included. 

INaproche ships as a component in the latest release of the Isabelle prover 
platform [8]. When editing a ForTheL file in Isabelle/jEdit Prover IDE (PIDE), 
there is an auxiliary Naproche server in the background to quickly answer re- 
quests for checking ForTheL texts, with an internal cache to avoid repeated 
checking of unchanged text segments. The implementation uses programming in- 
terfaces of Isabelle /PIDE that allow user-defined file formats to participate in the 
concurrent document model. A second auxiliary server allows the Naproche pro- 
gram to run external prover processes under the control of Isabelle, with explicit 
timeouts. This works reliably on the usual platforms (Linux, Windows, macOS) 
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by re-using external provers of Isabelle /Sledgehammer [17]. From the perspect- 
ive of logic, there is no connection of Naproche with Isabelle/Sledgehammer or 
any other Isabelle/HOL tools. 

In this paper we briefly discuss the need for natural proof assistants, provide 
some general information on Isabelle/Naproche, and give an overview of meth- 
ods employed in the system, using an excerpt from a formalization of Euclid’s 
infinitude of primes as a running example. To conclude we compare Naproche to 
other projects in formal mathematics with natural language input and indicate 
ways to further extend Naproche’s naturalness and efficiency. 


2 Natural Proof Assistants 


While state-of-the-art interactive theorem provers have been successfully used to 
prove and certify highly non-trivial research mathematics, they are still, accord- 
ing to Lawrence Paulson “unsuitable for mathematics. Their formal proofs 
are unreadable.” 

Natural proof assistants intend to bridge the wide gap between intuitive 
mathematical texts and the formal rigour of logical calculi. We propose the 
following criteria for natural proof assistants: 


— Input languages should be close to the mathematical vernacular, includ- 
ing support for common grammatical conventions and symbolic expressions. 
These languages should support familiar text structurings, such as the usual 
definition-theorem-proof style. 

— Proofs should consist of natural argumentative phrases for various proof 
tactics, allowing for a more declarative style. 

— The system should use familiar logics and mathematical ontologies. 

— Tedious details and obvious proof gaps should be filled in automatically. 

— An intuitive editor should allow for interactive text and theory development, 
where incremental proof checking can guide the formalization. 


We expect that naturalness will be crucial for the adoption of formal mathem- 
atics by the wider mathematical community. This is in line with some ongoing 
large-scale projects in formal mathematics. For instance, the ALEXANDRIA 
project by Paulson [I6] stipulates: 


ALEXANDRIA will be based on legible structured proofs. Formal proofs 
should be not mere code, but a machine-checkable form of communication 
between mathematicians. 


The Formal Abstracts project of Thomas Hales [5] intends to 


— give a statement of the main theorem of each published mathematical paper 
in a language that is both human and machine readable, 

— link each term in theorem statements to a precise definition of that term 
(again in human/machine readable form). 
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3 Isabelle/Naproche 


The Naproche proof assistant stems from two long-term efforts aiming towards 
naturalness: the Evidence Algorithm (EA) and System for Automated Deduc- 
tion (SAD) projects at the universities of Kiev and Paris [L4{15/20/21), and 
the Naproche project at Bonn [PBO]. Naproche extends the input language 
ForTheL of SAD and embeds it into ATX, allowing mathematical typesetting; 
the original proof-checking mechanisms of SAD have been made more efficient 
and varied. 

The first experimental integration of the then Naproche-SAD prover into 
the Isabelle Prover IDE was done in 2018 by Frerix and Wenzel [23] §1.2]. The 
current (refined and extended) version has now become a bundled component 
of Isabelle2021 [8]. After downloading and unpacking the Isabelle distribution, 
Isabelle/Naproche becomes immediately accessible in the Documentation panel, 
section Examples, entry $ISABELLE_NAPROCHE/Intro.thy. Isabelle and its add- 
on components work directly without manual installation, but this comes at 
the cost of substantial resource requirements: on Linux the total size is 1.2 GB, 
which includes Java 15 (330 MB), E prover 2.5 (30 MB), and Naproche (20 MB). 
The bulk of other Isabelle components are required for Isabelle/HOL theory and 
proof development, but Naproche has no logical connection to that. 

The Naproche prover is invoked automatically when editing ForTheL files 
with .ftl or .ftl.tex extensions. Further examples and an introductory tu- 
torial are linked in the Isabelle theory file $ISABELLE_NAPROCHE/Intro. thy: as 
usual for Isabelle/jEdit and other IDEs, following a link works by a mouse click 
combined with the keyboard modifier CTRL (Linux, Windows) or CMD (macOS). 
The examples deal with results from undergraduate number theory, geometry, 
and set theory; most are available in the classic ASCII style as well as in ATEX 
style and typeset in PDF. 

The ForTheL library FLib [I3] contains a variety of formalizations for earlier 
versions of Naproche. Some substantial texts have been written as undergraduate 
student projects and cover, e.g., group theory up to Sylow theorems, initial 
chapters from Walter Rudin’s Analysis, or set theory up to Silver’s theorem 
in cardinal arithmetic. These texts will soon be upgraded to the new version 
of Naproche and included in an interlinked formalized library of readable and 
proof-checked mathematical texts. 


4 Formalizing in ForTheL 


4.1 Example 


The following screenshot shows a proof of the infinitude of prime numbers in the 
Isabelle/Naproche Prover IDE taken from the bundled tutorial which itself is a 
proof-checked ForTheL text: 
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The editor buffer contains the ForTheL source, which also happens to conform 
to standard TEX format. (The “Contradiction” lemma, now deactivated by a %, 
is a left-over of a typical check for hidden inconsistencies in the axiomatic setup.) 
The Output panel contains feedback from the Naproche prover about the source 
document: “verification successful’ and some statistics; the most relevant mes- 
sages are also shown in-line over the source as squiggly underline with popup on 
mouse-hovering. The Sidekick/latex structure overview is provided by standard 
plugins of the underlying text editor. This piece of mathematics is typeset by 
LTRX as follows: 


Euclid’s Theorem 


Signature. P is the class of prime natural numbers. 

Theorem. P is infinite. 

Proof. Assume that r is a natural number and p is a sequence of length r 
and {p,...,pr} is a subclass of P. [...] o 


4.2 The ForTheL Language 


The mathematical controlled language ForTheL has been developed over several 
decades in the Evidence Algorithm (EA) / System for Automated Deduction 
(SAD) project. It is carefully designed to approximate the weakly typed nat- 
ural language of mathematics whilst being efficiently translatable to first-order 
logic. In ForTheL, standard mathematical types are called notions, and these 
are internally represented as predicates with a distinguished variable, which are 
treated as unary predicates with the other variables used as parameters (“types 
as predicates”). This leads to a flexible dependent type system where number 
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systems can be cumulative (N C R), and notions can depend on parameters 
(subsets of N, divisors of n). 

First-order languages of notions, constants, relations, and functions can be 
introduced and extended by signature and definition commands. The formaliz- 
ation of Euclid’s theorem, e.g., sets out like: 


Signature. A natural number is a small object. 
Let ...m,n... denote natural numbers. 
Signature. 0 is a natural number. 


Signature. m + n is a natural number. 


5 Architecture of the Naproche System 


Naproche follows standard principles of interactive theorem proving, but with 
a strong emphasis on the naturalness aspects explained above. The general in- 
formation processing in the system is described in the following diagram. The 
core Naproche program is implemented in Haskell. 


ForTheL (ASCII) o Text processing (Messages / Logs \ 
1 


1] 

1 

poe D A 1 

ForTheL (TEX) >| © Tokenizing ! Errors, warnings, ! 


successes, etc. 


ar oo 


Ontological Checking 


e Parsing K 


e Translation 


A o 


Identifies ontological 


Annotated tree of requirements and adds this TPTP E 

first-order information to the document . 

statements Further Vampire 
annotations | Logical Checking yes/no etc. 


Applies basic proof tactics 
and prepares tasks for the 
external prover 


In the sequel we shall describe main components of Naproche. 


5.1 Tokenizing and Parsing 


Naproche uses a standard tokenizing algorithm for cutting text up into a list of 
meaningful tokens, with precise source positions to enable PIDE messages and 
markup, e.g., by colours for free and bound variables. When using IATFX syntax, 
the tokenizer also takes care of expanding certain TFX commands (see the next 
subsection). 

Parsing is carried out in Haskell’s monadic style with parser combinators. 
We allow ambiguous parsing, since it better fits natural language. Currently the 
translation into tagged first-order logic is already part of the parsing process. The 
following translation of our example snippet was obtained by running Naproche 
from the command line with the -T (translate) option: 
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hypothesis. 
assume forall vO ((HeadTerm :: vO = Primes) implies 
(aClass(v0O) and forall vi (aElementOf (v1,v0) 
iff (aNaturalNumber(v1) and isPrime(v1))))). 


conjecture Euclid. 
isInfinite(Primes). 
proof. 
assume ((aNaturalNumber(r) and aSequenceOfLength(p,r)) and 
aSubsetOf (Set{p}{1}{r},Primes)). 
n = Prod{p}{1}{r}+1. 


In order to make Naproche more versatile we plan on parsing into an abstract 
syntax tree instead, so that different logical back-ends could translate into dif- 
ferent logics. We have already made some experiments on translating ForTheL 
to Lean [12]. 

Moreover, with the input language growing, we shall eventually turn to some 
grammatical framework to speed up language development without hard-coding 
vocabulary or grammar rules into the Naproche code. 


5.2 TEX Processing 


We have extended Naproche to support a .ftl.tex format, in addition to the 
original .ftl format. Files in .ftl.tex format are intended to be readable by 
both Naproche for logical checking and by IATEX for typesetting. 

The TEX tokenizer ignores the whole document, except what is inside 
forthel environments of the form 


\begin{forthel} 
% Insert what you want Naproche to process here 
\end{forthel} 


In a forthel environment, standard “TRX syntax can be used for declaring text 
environments for theorems and definitions. 

In Naproche, users can define their own operators and phrases by defining lin- 
guistic and symbolic patterns. This mechanism has been adapted to allow TEX 
constructs in patterns. In the Euclid text we use the pattern \Set{p}{1}{r} for 
the finite set {pi,...,p,-}. By defining \Set as a AT@X macro we can arrange 
that the ForTheL pattern will be printed in the familiar set notation: 


\newcommand{\Set} [3] {\{#1_{#2}, \dots ,#1_{#3}\}} 


There are some primitive concepts in Naproche, such as the logical operators V, 
A, J that are directly recognized in the XTX source and expanded to corres- 
ponding internal tokens. 
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The current release of Naproche does not differentiate between math mode 
and text mode in IATRX, since it re-uses much of the parsing machinery of the 
original .ft1 format. Future releases shall make such a distinction to increase the 
robustness of the parser, improve error messages and resolve some ambiguities 
in the current grammar. 


5.3 Logical Processing 


The first-order formulas derived from ForTheL statements are put into an in- 
ternal ProofText data type consisting of blocks of formulae, arranged in a tree- 
like fashion. The tree structure mirrors the logical structure of a text, where a 
statement can be seen as a node to which a subtext, e.g., its proof is attached. 
Since statements in a proof can have their own subproofs this leads to a recurs- 
ive tree structure, on which the further checking is performed along a depth-first 
left-to-right traversal. 


5.4 Ontological Checking by the Naproche Reasoner 


2 contains a number of 


An innocent mathematical statement like a? + b? = c 
implicit proof tasks, even if the whole statement is not to be proved, but part of 
a definition or an assumption. One has to check that a, b,c are (numerical) terms 
to which the squaring operation can be applied, and that the resulting squares 
can be subjected to addition and equality. These checks are called “ontological”, 
and they roughly correspond to type checking in type-orientated systems. The 
situation here is however more complicated, as types (i.e. notions) and operations 
may involve first-order definitions with preconditions, which cannot be decided 
during the parsing process but only during proof-checking. So in the checking 
process each node of the aforementioned tree is first checked ontologically; if the 
node formula itself is marked as a conjecture, it is logically checked. 


5.5 Logical Checking by the Naproche Reasoner 


The various checks are organized by the Naproche reasoner module. In simple 
cases the reasoner itself can supply a proof; if not, the reasoner constructs proof 
tasks for the ATP. Since definitions in first-order logic are formally symmetric 
equivalences, they may lead to circularities in proof searches. Instead definitions 
are successively unfolded by replacing the definiendum by the definiens. This 
process may be iterated when proof attempts fail. 

The ATP is given certain timeouts to search for proofs. Ontological checking 
is supposed to be easier than proper mathematical proving. So the default time 
for each ontological check is set to 1 sec, whereas proving gets 3sec and can be 
iterated for several rounds of definition unfolding. 
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5.6 Communication with an External ATP 


Proof tasks are translated into the generic TPTP first-order format for ATPs. 
These can be viewed in the Output window of Isabelle/jEdit, after inserting the 
directive [dump on] into the ForTheL source. The final proof task in checking 
Euclid’s proof ends with the TPTP lines: 


fof(m_,hypothesis,( ! [WO] : (aClass(WO) => 
(isInfinite(WO) <=> ( ~ isFinite(WO)))))). 

fof (m_, hypothesis, (aClass(szPzrzizmzezs) & 
( ! [WO] : (aElementOf(WO,szPzrzizmzezs) 
<=> (aNaturalNumber (WO) & isPrime(WO)))))). 

fof (m__,conjecture, 
(aElementOf (W4,szSzeztlcdtrclczirclcdtrc(W0O,W1)) <=> 
(aNaturalNumber (W4) & isPrime(W4))))))))))))) => 
isInfinite(szPzrzizmzezs))). 


By default Naproche uses E prover [19] as external ATP, but one may switch to 
other provers available in the Isabelle distribution. 


6 Integration into Isabelle 


The initial integration of Naproche into the Isabelle Prover IDE happened in 
2018 and is briefly reported as an example in the PIDE overview article [23] 
based on Isabelle2019 (June 2019). The main idea was to turn the existing 
Haskell command-line program into a TCP server that can answer concurrent 
requests for checking ForTheL texts in a purely functional manner, with proper 
handling of cancel messages (for interrupts caused by user editing); this required 
to remove a few low-level system operations, like reading physical files or exit 
of the process. Afterwards, the semantic operation forthel file in Isabelle — 
to check ForTheL text and produce markup messages according to the PIDE 
protocol — was implemented as Isabelle/Isar command in Isabelle/ML as usual, 
but the main work is delegated to the Naproche server. Its implementation uses 
the Isabelle/Haskell library for common Isabelle/PIDE message formats, source 
positions, markup etc. — it is maintained within the Isabelle distribution. 

The current version of Isabelle/Naproche refines this approach in various 
respects. In particular, Isabelle2021 now provides a standard mechanism for 
user-defined [sabelle/Scala services: this is both relevant for Isabelle command- 
line tools to build and test Isabelle/Naproche, and the Prover IDE support of 
ForTheL files to connect the Isabelle/jEdit front-end to the Naproche back-end. 

Moreover, the Java process running the Prover IDE provides an additional 
TCP server to launch external provers that are already distributed with Isabelle 
(thanks to Isabelle/Sledgehammer): Naproche applications mainly use the cur- 
rent E prover 2.5 [I9], but SPASS and Vampire are available for experiments. 
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The existing management of processes in Isabelle/Scala involves considerable ef- 
forts to robustly support interrupts and timeouts in a concurrent environment; 
this works on all platforms supported by Isabelle (using special tricks for Win- 
dows/Cygwin, and macOS/Rosetta on Apple Silicon). 

The documentation file $ISABELLE_NAPROCHE/Intro.thy gives further hints 
on implementation near the end, with hyperlinks to the sources. A lot of technical 
Isabelle infrastructure is re-used by Isabelle/Naproche, but there is presently no 
connection to Isabelle/HOL, which is a much larger and better-known applica- 
tion of the same Isabelle framework [18]. 


7 Related and Future Work 


Bridging the gap between mathematical practice and fully formal methods has 
always been a central concern in formal mathematics. The development of the 
Mizar system [11] was accompanied or even driven by the stepwise adaptation of 
its language to standard mathematical proof methods and logical foundations. 
In contrast, most interactive theorem provers feature formal tactic languages, 
with tactics scripts that can hardly be understood without stepwise tracing and 
reconstructing internal logical states. 

The Mizar language has been a role model for other proof languages. There 
are, e.g., "Mizar modes" for HOL and Coq [4] and the widely used Isar 
language for Isabelle [24122]. These language can be read by mathematicians, 
with some effort, but they retain a strong bias toward computer science customs. 
A survey of input languages for formalization on a scale between formal and 
natural can be found in [9]. 

Only a few formal mathematics projects have aimed at processing actual 
mathematical language. These projects have operated in isolation and seem to 
be mostly inactive now. The paper [7] by Muhammad Humayoun and Christophe 
Raffalli, e.g., describes the MathNat project and also surveys other related at- 
tempts. 

The Naproche approach can be viewed in the Mizar tradition: use a rich 
controlled language for mathematics, increase the proving capabilities by strong 
automated theorem proving, and, eventually, create an extensive library of basic 
mathematics and specialized theories, which simultaneously can be used as a 
library for human readers. 

The readability and naturalness of texts which proof-check in the Naproche 
system motivate significant further extensions of the project where ad hoc meth- 
ods are to be replaced by principled and established approaches: 

1. the input language ForTheL has to be extended for wide mathematical 
coverage; ForTheL needs an extensive formal grammar and vocabulary to be 
processed by strong linguistic methods; the vocabulary may also encompass 
standard TEX symbols and semantic information; 

2. methods of type derivation and elaboration should be provided; 

3. Isabelle /Sledgehammer-like methods should lead to efficient premise selec- 
tion in large texts and theories; 
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4. the creation of libraries of ForTheL documents requires import and ex- 


port mechanisms corresponding to quoting and referencing in the mathematical 
literature; 


5. the natural text processing of Naproche should be interfaced with other 


proof assistants to leverage their strengths and libraries. We shall in particular 
work on a “Naproche mode” for Isabelle. 
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Abstract. Lean 4 is a reimplementation of the Lean interactive theo- 
rem prover (ITP) in Lean itself. It addresses many shortcomings of the 
previous versions and contains many new features. Lean 4 is fully extensi- 
ble: users can modify and extend the parser, elaborator, tactics, decision 
procedures, pretty printer, and code generator. The new system has a hy- 
gienic macro system custom-built for ITPs. It contains a new typeclass 
resolution procedure based on tabled resolution, addressing significant 
performance problems reported by the growing user base. Lean 4 is also 
an efficient functional programming language based on a novel program- 
ming paradigm called functional but in-place. Efficient code generation 
is crucial for Lean users because many write custom proof automation 
procedures in Lean itself. 


1 Introduction 


The Lean project] started in 2013 [9] as an interactive theorem prover based on 
the Calculus of Inductive Constructions [4] (CIC). In 2017, using Lean 3, a com- 
munity of users with very different backgrounds started the Lean mathematical 
library project mathlib [I3]. At the time of this writing, mathlib has roughly half 
a million lines of code, and contains many nontrivial mathematical objects such 
as Schemes [2]. Mathlib is also the foundation for the Perfectoid Spaces in Lean 
project [I], and the Liquid Tensor challenge posed by the renowned mathe- 
matician Peter Scholze. Mathlib contains not only mathematical objects but also 
Lean metaprograms that extend the system [5]. Some of these metaprograms 
implement nontrivial proof automation, such as a ring theory solver and a de- 
cision procedure for Presburger arithmetic. Lean metaprograms in mathlib also 
extend the system by adding new top-level command and features not related 
to proof automation. For example, it contains a package of semantic linters that 
alert users to many commonly made mistakes [5]. Lean 3 metaprograms have 


3 http: //leanprover.github.io 
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been also instrumental in building standalone applications, such as a SQL query 
equivalence checker [3]. 

We believe the Lean 3 theorem prover’s success is primarily due to its exten- 
sibility capabilities and metaprogramming framework [6]. However, users cannot 
modify many parts of the system without changing Lean 3 source code written 
in C++. Another issue is that many proof automation metaprograms are not 
competitive with similar proof automation implemented in programming lan- 
guages with an efficient compiler such as C++ and OCaml. The primary source 
of inefficiency in Lean 3 metaprograms is the virtual machine interpretation 
overhead. 

Lean 4 is a reimplementation of the Lean theorem prover in Lean itself] 
It is an extensible theorem prover and an efficient programming language. The 
new compiler produces C code, and users can now implement efficient proof au- 
tomation in Lean, compile it into efficient C code, and load it as a plugin. In 
Lean 4, users can access all internal data structures used to implement Lean 
by merely importing the Lean package. Lean 4 is also a platform for developing 
efficient domain-specific automation. It has a more robust and extensible elab- 
orator, and addresses many other shortcomings of Lean 3. We expect the Lean 
community to extend and add new features without having to change the Lean 
source code. We released Lean 4 at the beginning of 2021, it is open source, the 
community is already porting mathlib, and the number of applications is quickly 
growing. It includes a translation verifier for Reopt| a package for supporting 
inductive-inductive typeq)] and a car controlleyf’] 


2 Lean by Example 


In this section, we introduce the Lean language using a series of examples. The 


source code for the examples is available at https://github.com/leanprover/ 
lean4/blob/cade2021/doc/BoolExpr.lean, For additional details and instal- 


lation instructions, we recommend the reader consult the online manual] 

We define functions by using the def keyword followed by its name, a pa- 
rameter list, return type, and body. The parameter list consists of successive 
parameters that are separated by spaces. We can specify an explicit type for 
each parameter. If we do not specify a specific argument type, the elaborator 
tries to infer the function body’s type. The Boolean or function is defined by 
pattern-matching as follows 


def or (a b : Bool) := 
match a with 
| true => true 
| false => b 


4 http://github.com/leanprover/lean4 


https://github.com/GaloisInc/reopt-vcg 


https://github.com/javra/iit 


https://github.com/GaloisInc/lean4-balance-car 


http://leanprover.github.io/lean4/doc 
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We can use the command #check <term> to inspect the type of term, and #eval 
<term> to evaluate it. 


#check or true false -- Bool (this is a comment in Lean) 
#eval or true false -- true 


Lean has a hygienic macro system and comes equipped with many macros for 
commonly used idioms. For example, we can also define the function or using 


def or : Bool — Bool — Bool 
| true, _ => true 
| false, b => b 


The notation above is a macro that expands into a match-expression. In Lean, 
a theorem is a definition whose result type is a proposition. For an example, 
consider the following simple theorem about the definition above 


theorem or_true (b : Bool) : or true b = true := 
rfl 


The constant rfl has type V {œ : Sort u} {a : a}, a = a, the curly braces 
indicate that the parameters « and a are implicit and should be inferred by 
solving typing constraints. In the example above, the inferred values for « and a 
are Bool and or true b, respectively, and the resulting type is or true b = or 

true b. This is a valid proof because or true b is definitionally equal to b. In 
dependent type theory, every term has a computational behavior, and supports 
a notion of reduction. In principle, two terms that reduce to the same value are 
called definitionally equal. In the following example, we use pattern matching to 
prove that or b b = b 


theorem or_self : V (b : Bool), or bb=b 
| true => rfl 
| false => rfl 


Note that or b b does not reduce to b, but after pattern matching we have that 
or true true (or false false) reduces to true (false). 

In the following example, we define the recursive datatype BoolExpr for rep- 
resenting Boolean expressions using the command inductive. 


inductive BoolExpr where 
| var (name : String) 
| val (b : Bool) 
| or (p q : BoolExpr) 
| not (p : BoolExpr) 


This command generates constructors BoolExpr.var, BoolExpr.val, BoolExpr.or, 
and BoolExpr.not. The Lean kernel also generates an inductive principle for the 
new type BoolExpr. We can write a basic “simplifier” for Boolean expressions as 
follows 


def simplify : BoolExpr — BoolExpr 
| BoolExpr.or p q => mkOr (simplify p) (simplify q) 
| BoolExpr.not p => mkNot (simplify p) 
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le => e 
where 

mkOr : BoolExpr — BoolExpr — BoolExpr 
| p, BoolExpr.val true => BoolExpr.val true 
| p, BoolExpr.val false => p 
| BoolExpr.val true, p => BoolExpr.val true 
| BoolExpr.val false, p => p 
| p,q => BoolExpr.or p q 


mkNot : BoolExpr — BoolExpr 
| BoolExpr.val b => BoolExpr.val (not b) 
| p => BoolExpr.not p 


The function simplify is a simple bottom-up simplifier. We use the where clause 
to define two local auxiliary functions mkOr and mkNot for constructing “simplified” 
or and not expressions respectively. Their global names are simplify.mkOr and 
simplify.mkNot. 

Given a context that maps variable names to Boolean values, we define a “de- 
notation” function (or evaluator) for Boolean expressions. We use an association 
list to represent the context. 


abbrev Context := AssocList String Bool 


def denote (ctx : Context) : BoolExpr — Bool 
BoolExpr.or p q => denote ctx p || denote ctx q 
BoolExpr.not p => !denote ctx p 


| 
| 
| BoolExpr.val b => b 
| 


BoolExpr.var x => if let some b := ctx.find? x then b else false 
In the example above, p || q is notation for or p q, !p for not p, and if let 
p := t then a else bisa macro that expands into match t with | p => a | _ 


=> b. The term ctx.find? x is syntax sugar for AssocList.find? ctx x. 

As in previous versions, we can use tactics for constructing proofs and terms. 
We use the keyword by to switch into tactic mode. Tactics are user-defined or 
built-in procedures that construct various terms. They are all implemented in 
Lean itself. The simp tactic implements an extensible simplifier, and is one of 
the most popular tactics in mathlib. Its implementation P] can be extended and 
modified by Lean users. 


@[simp] theorem denote_mkOr (ctx : Context) (p q : BoolExpr) 
: denote ctx (simplify.mkOr p q) = denote ctx (or p q) := 


def denote_simplify (ctx : Context) (p : BoolExpr) 
: denote ctx (simplify p) = denote ctx p := 
by induction p with 
| or p q ih; iho => simp [ih,, iho] 


? \nttps://github.com/leanprover/lean4/blob/cade21/src/Lean/Meta/Tactic/ 
Simp/Main. lean 
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| not p ih => simp [ih] 
| => rfl 


In the example above, we use the induction tactic, its syntax is similar to a match- 
expression. The variables ih; and ihə are the induction hypothesis for p and q in 
the first alternative for the case p is a BoolExpr.or. The simp tactic uses any theo- 
rem marked with the @[simp] attribute as a rewriting rule (e.g., denote_mkOr). We 
explicitly provide the induction hypotheses as additional rewriting rules inside 
square brackets. 


Typeclass Resolution. Typeclasses [16] provide an elegant and effective way of 
managing ad-hoc polymorphism in both programming languages and interactive 
proof assistants. Then we can declare particular elements of a typeclass to be 
instances. These provide hints to the elaborator: any time the elaborator is 
looking for an element of a typeclass, it can consult a table of declared instances 
to find a suitable element. What makes typeclass inference powerful is that one 
can chain instances, that is, an instance declaration can in turn depend on other 
instances. This causes class inference to recurse through instances, backtracking 
when necessary. The Lean typeclass resolution procedure can be viewed as a 
simple A-Prolog interpreter [8], where the Horn clauses are the user declared 
instances. 

For example, the standard library defines a typeclass Inhabited to enable 
typeclass inference to infer a “default” or “arbitrary” element of types that contain 
at least one element. 


class Inhabited (« : Sort u) where 
default : x 


def arbitrary [Inhabited a] : a := 
Inhabited.default 


The annotation [Inhabited a] at arbitrary indicates that this implicit parame- 
ter should be synthesized from instance declarations using typeclass resolution. 
We can define an instance for our BoolExpr type defined earlier as follows 


instance : Inhabited BoolExpr where 
default := BoolExpr.val false 


This instance specifies that the “default” element for BoolExpr is BoolExpr.val 
false. The following declaration shows that if two types « and ß are inhabited, 
then so is their product: 


instance [Inhabited «] [Inhabited B] : Inhabited (œ x B) where 
default := (arbitrary, arbitrary) 


The standard library has many builtin classes such as Repr a« and DecidableEq 
a. The class Repr « is similar to Haskell’s Show «œ typeclass, and DecidableEq « 
is a typeclass for types that have decidable equality. Lean 4 also provides code 
synthesizers for many builtin classes. The command deriving instructs Lean to 
auto-generate an instance. 
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deriving instance DecidableEq for BoolExpr 


#eval decide (BoolExpr.val true = BoolExpr.val false) -- false 
In the example above, the deriving command generates the instance 
(a b : BoolExpr) —> Decidable (a = b) 


The function decide evaluates decidable propositions. Thus, the last command 
returns false since BoolExpr.val true is not equal to BoolExpr.val false. 

The increasingly sophisticated uses of typeclasses in mathlib have exposed a 
few limitations in Lean 3: unnecessary overhead due to the lack of term indexing 
techniques, and exponential running times in the presence of diamonds. Lean 4 
implements a new procedure [12], tabled typeclass resolution, that solves these 
problems by using discrimination tree{! for better indexing and tabling, which 
is a generalization of memoizing introduced initially to address similar limitations 
of early logic programming systema! 


The hygienic macro system. In interactive theorem provers (ITPs), Lean in- 
cluded, extensible syntax is not only crucial to lower the cognitive burden of 
manipulating complex mathematical objects, but plays a critical role in devel- 
oping reusable abstractions in libraries. Lean 3 support such extensions in the 
form of restrictive “syntax sugar” substitutions and other ad hoc mechanisms, 
which are too rudimentary to support many desirable abstractions. As a result, 
libraries are littered with unnecessary redundancy. The Lean 3 tactic languages 
is plagued by a seemingly unrelated issue: accidental name capture, which often 
produces unexpected and counterintuitive behavior. Lean 4 takes ideas from the 
Scheme family of programming languages and solves these two problems simul- 
taneously by use of a hygienic, i.e. capture-avoiding, macro system custom-built 
for ITPs [5]. 

Lean 3’s “mixfix” notation system is still supported in Lean 4, but based 
on the much more general macro system; in fact, the Lean 3 notation keyword 
itself has been reimplemented as a macro, more specifically as a macro-generating 
macro. By providing such a tower of abstractions for writing syntax sugars, of 
which we will see more levels below, we want to enable users to work in the 
simplest model appropriate for their respective use case while always keeping 
open the option to switch to a lower, more expressive level. 

As an example, we define the infix notation T F p, with precedence 50, for 
the function denote defined earlier. 


infix:50 "H" => denote 
The infix command expands to 
notation:50 [ "H" p:50 => denote I p 


10 https: //github.com/leanprover/lean4/blob/cade21/src/Lean/Meta/ 
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which itself expands to the macro declaration 
macro:50 [:term "H" p:term:50 : term => ~(denote $I $p) 


where the syntactic category (term) of placeholders and of the entire macro is 
now specified explicitly, implying that macros can also be written for /using other 
categories such as the top-level command. The right-hand side uses an explicit 
syntax quasiquotation to construct the syntax tree, with syntax placeholders 
(antiquotations) prefixed with $. As suggested by the explicit use of quotations, 
the right-hand side may now be an arbitrary Lean term computing a syntax 
object, allowing for procedural macros as well. 

macro itself is another command-level macro that, for our notation example, 
expands to two commands 


syntax:50 term "H" term:50 : term 
macro_rules 


| ~($1 F $e) => ~(denote $I $e) 


that is, a pair of parser extension and syntax transformer. By separating these 
two steps at this abstraction level, it becomes possible to define (mutually) re- 
cursive macros and to reuse syntax between macros. Using macro_rules, users 
can even extend existing macros with new rules. In general, separating pars- 
ing and expansion means that that we can obtain a well-structured syntax tree 
pre-expansion, i.e. a concrete syntax tree, and use it to implement source code 
tooling such as auto-completion, go-to-definition, and refactorings. 

We can use the syntax command for defining embedded domain-specific lan- 
guages. In simple cases, we can reuse existing syntactic categories for this but 
assign them new semantics, such as in the following notation for constructing 
BoolExpr objects. 


syntax "` [BExpr|" term "]" : term 
macro_rules 
| `C [BExpr| true]) => ~(BoolExpr.val true) 
` C [BExpr| false] ) => ~(BoolExpr.val false) 


| 
| ` C [BExpr| $x:ident]) => ~(BoolExpr.var $(quote x.getId.toString) ) 
| `C [BExpr| $p V $q]) => ~(BoolExpr.or ~[BExpr| $p] ~[BExpr| $q]) 
| `C [BExpr| ~ $p]) => ~(BoolExpr.not ~[BExpr| $p]) 


#check ~[BExpr| p V true] 
-- BoolEzpr.or (BoolEzpr.var "p") (BoolEzpr.val true) : BoolExpr 


The macro_rules command above specifies how to convert a subset of the builtin 
syntax for terms into constructor applications for BoolExpr. The term $(quote 
x.getId.toString) converts the identifier x into a string literal. 

As a final example, we modify the notation T + p. In the following version, T 
is not an arbitrary term anymore, but a comma-separated sequence of entries of 
the form var ++ value, and the right-hand side is now interpreted as a BoolExpr 
term by reusing our macro from above. 


syntax entry := ident "+> " term:max 
syntax entry,* "H" term : term 
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macro_rules 
| ~€ $[$xs:ident > $vs:term],* H $p:term ) => 
let xs := xs.map fun x => quote x.getId.toString 
` (denote (List.toAssocList [$[( $xs , $vs )],*]) ~[BExpr| $p]) 
#eval a |> false, br» true bVa -- true 


We use the antiquotation splice $[$xs:ident +> $vs:term] ,* to deconstruct the 
sequence of entries into two arrays xs and vs containing the variable names and 
values, respectively, adjust the former array, and combine them again in a second 
splice. 


3 The Code Generator 


The Lean 4 code generator produces efficient C code. It is useful for building 
both efficient Lean extensions and standalone applications. The code genera- 
tor performs many transformations, and many of them are based on techniques 
used in the Haskell compiler GHC [7]. However, in contrast to Haskell, Lean is a 
strict language. We control code inlining and specialization using the attributes 
@{inline] and @[specialize]. They are crucial for eliminating the overhead in- 
troduced by the towers of abstractions used in our source code. Before emitting 
C code, we erase proof terms and convert Lean expressions into an intermediate 
representation (IR). The IR is a collection of Lean data structures[?| and users 
can implement support for backends other than C by writing Lean programs 
that import Lean.Compiler.1IR. Lean 4 also comes with an interpreter for the IR, 
which allows for rapid incremental development and testing right from inside the 
editor. Whenever the interpreter calls a function for which native, ahead-of-time 
compiled code is available, it will switch to that instead, which includes all func- 
tions from the standard library. Thus the interpretation overhead is negligible 
as long as e.g. all expensive tactics are precompiled. 


Functional but in-place. Most functional languages rely on garbage collection 
for automatic memory management. They usually eschew reference counting in 
favor of a tracing garbage collector, which has less bookkeeping overhead at run- 
time. On the other hand, having an exact reference count of each value enables 
optimizations such as destructive updates [14]. When performing functional up- 
dates, objects often die just before creating an object of the same kind. We 
observe a similar phenomenon when we insert a new element into a purely func- 


tional data structure, such as binary trees, a theorem prover rewrites formulas, 
a compiler applies optimizations by transforming abstract syntax trees, or the 
function simplify defined earlier. We call it the resurrection hypothesis: many 
objects die just before creating an object of the same kind. The Lean mem- 
ory manager uses reference counting and takes advantage of this hypothesis, 
and enables pure code to perform destructive updates in all scenarios described 


1? https: //github.com/leanprover/lean4/blob/cade21/src/Lean/Compiler/IR/ 
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above when objects are not shared. It also allows a novel programming paradigm 
that we call functional but in-place (FBIP) [I0]. Our preliminary experimental 
results demonstrate our new compiler produces competitive code that often out- 
performs the code generated by high-performance compilers such as ocamlopt 
and GHC [I4]. As an example, consider the function map f as that applies a 
function f to each element of a list as. In this example, [] denotes the empty 
list, and a: :as the list with head a followed by the tail as. 


def map : («a > B) — List a > List B 
| f, 0 => [] 


| f, a::as => f a :: map f as 


If the list referenced by as is not shared, the code generated by our compiler does 
not allocate any memory. Moreover, if as is a nonshared list of list of integers, 
then map (map inc) as will not allocate any memory either. In contrast to 
static linearity systems, allocations are also avoided even if only a prefix of the list 
is not shared. FBIP also allows Lean users to use data structures, such as arrays 
and hashtables, in pure code without any performance penalty when they are not 
shared. We believe this is an attractive feature because hashtables are frequently 
used to implement decision procedures and nontrivial proof automation. 


4 The User Interface 


Our system implements the Language Server Protocol (LSP) using the task ab- 
straction provided by its standard library. The Lean 4 LSP server is incremental 
and is continuously analyzing the source text and providing semantic informa- 
tion to editors implementing LSP. Our LSP server implements most LSP features 
found in advanced IDEs, such as hyperlinks, syntax highlighting, type informa- 
tion, error handling, auto-completion, etc. Many editors implement LSP, but VS 
Code is the preferred editor by the Lean user community. We provide extensions 
for visualizing the intermediate proof states in interactive tactic blocks, and we 
want to port the Lean 3 widget library for constructing interactive visualizations 
for their proofs and programs. 


5 Conclusion 


Lean 4 aims to be a fully extensible interactive theorem prover and functional 
programming language. It has an expressive logical foundation for writing mathe- 
matical specifications and proofs and formally verified programs. Lean 4 provides 
many new unique features, including a hygienic macro-system, an efficient type- 
class resolution procedure based on tabled resolution, efficient code generator, 
and abstractions for sealing low-level optimizations. The new elaboration proce- 
dure is more general and efficient than those implemented in previous versions. 
Users may also extend and modify the elaborator using Lean itself. Lean has a 
relatively small trusted kernel, and the rich API allows users to export their de- 
velopments to other systems and implement their own reference checkers. Lean 
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is an ongoing and long-term effort, and future plans include integration with 
external SMT solvers and first-order theorem provers, new compiler backends, 
and porting the Lean 3 Mathematical Library. 
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Abstract. BELUGA is a proof checker that provides sophisticated in- 
frastructure for implementing formal systems with the logical framework 
LF and proving metatheoretic properties as total, recursive functions 
transforming LF derivations. In this paper, we describe HARPOON, an 
interactive proof engine built on top of BELUGA. It allows users to de- 
velop proofs interactively using a small, fixed set of high-level actions 
that safely transform a subgoal. A sequence of actions elaborates into a 
(partial) proof script that serves as an intermediate representation de- 
scribing an assertion-level proof. Last, a proof script translates into a 
BELUGA program which can be type-checked independently. HARPOON 
is available on GitHub. We have used HARPOON to replay a wide array 
of examples covering all features supported by BELUGA. In particular, 
we have used it for normalization proofs, including the recently proposed 
POPLMark reloaded challenge. 


1 Introduction 


Mechanizing formal systems and proofs about them plays an important role in 
establishing trust in programming languages and verifying software systems in 
general. Key questions in this setting are how to represent variables, (simulta- 
neous) substitutions, assumptions, and derivations that depend on assumptions. 
Higher-order abstract syntax (HOAS) provides an elegant and unifying answer 
to these questions, relieving users from having to write boilerplate code. 

BELUGA is a proof checker with built-in support for HOAS encodings of for- 
mal systems based on the logical framework LF [I3]. Metatheoretic inductive 
proofs are implemented as recursive, dependently-typed functions that manip- 
ulate and transform HOAS representations [21]4)25]. In this paper, we describe 
the interactive proof engine HARPOON which is built on top of BELUGA. A 
HARPOON user modularly and incrementally develops a metatheoretic proof by 
solving independent subgoals via a fixed set of high-level actions. An action elim- 
inates the subgoal on which it is executed, filling it with a proof that possibly 
contains new subgoals to be resolved. The actions we support are: introduction of 
assumptions, case-analysis, inductive reasoning, and both forward and backward 
reasoning styles. 


© The Author(s) 2021 
A. Platzer and G. Sutcliffe (Eds.): CADE 2021, LNAI 12699, pp. 636-648, 2021. 
https: //doi.org/10.1007/978-3-030-79876-5_38 
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While our fixed set of actions is largely inspired by similar systems such 
as Twelf 20/28)27] and Abella [II], HARPOON advances the state of the art in 
interactively developing mechanized proofs about HOAS representations in two 
ways: 1. We treat subgoals as first-class and characterize them using contextual 
types that pair their goal types together with the contexts in which they are 
meaningful; a contextual substitution property guarantees that each step of proof 
development correctly refines the partial proof under construction [8]. 2. Rather 
than simply record the sequence of actions given by the user, we elaborate this 
sequence into an assertion-level proof [I5], represented as what we call a proof 
script. The proof script is what we record as output of an interactive session. It 
can be both typechecked directly and translated into a BELUGA program. 

We have used HARPOON (see https: //beluga-lang.readthedocs.io/) on a wide 
range of representative examples from the BELUGA library: normalization proofs 
for the simply-typed lambda calculus [6], benchmarks for reasoning about binders 
PDI], and the recent POPLMark Reloaded challenge [I]. These examples involve 
numerous concerns that arise in proof development, and cover all the domain- 
specific abstractions that BELUGA provides. Our experience shows that HAR- 
POON lowers the entry barrier for users: they only need to understand how to 
represent formal systems and derivations using HOAS encodings and can then 
manipulate the HOAS representations directly via the high-level actions which 
correspond closely to how proofs are developed on paper. As such, we believe 
that HARPOON eases the task of proving metatheoretic statements. 


2 Proof Development in Harpoon 


We introduce the main features of HARPOON by interactively developing the 
proof of two lemmas that play a central role in the proof of weak normalization 
of the simply-typed lambda calculus. For a more detailed description, see [6]. 


2.1 Initial setup: encoding the language 


We begin by defining the simply-typed lambda-calculus in the logical framework 
LF [I3] using an intrinsically typed encoding. In typical HOAS style, lambda 
abstraction takes an LF function representing the abstraction of a term over a 
variable. There is no case for variables, as they are treated implicitly. We remind 
the reader that this is a weak, representational function space — there is no case 
analysis or recursion, so only genuine lambda terms can be represented. 


LF tp : type = LF tm : tp — type = 
| unit: tp | lam : (tm T1 — tm T2) — tm (arr T1 T2) 
| arr : tp > tp — tp; | app : tm (arr T1 T2) —> tm T1 — tm T2; 


Free variables such as T1 and T2 are implicitly universally quantified (see [23]) 
and programmers subsequently do not supply arguments for implicitly quantified 
parameters when using a constructor. 
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Next, we define a small-step operational semantics for the language. For 
simplicity, we use a call-by-name reduction strategy and do not reduce under 
lambda-abstractions. Note that we use LF application to encode the object-level 
substitution in the s_beta rule. 


LF step : tm T —> tm T — type = LF steps : tm T — tm T — type = 
| s_app : step M M’ | next : step M M?’ — steps M’ N 
— step (app M N) (app M’ N) — steps MN 
| s_beta: step (app (lam M) N) (MN); | refl: steps M M; 


Using this definition, we define a notion of termination: a term halts if it 
reduces to a value. This is captured by the constructor halts/m. 


LF val : tm T — type = v_lam: val (lam M); 
LF halts : tm T —> type = halts/m : val V — steps M V —> halts M; 


2.2 Termination Property: intros, split, unbox, and solve 


As the first short lemma, we show the Termination property: if M° is known to 
halt and steps M M’, then M also halts. We start our interactive proof session by 
loading the signature and defining the name of the theorem and the statement 
that we want to prove. 


Name of theorem: halts_step 
Statement of theorem: [ F step M M’] > [ F halts M’] > [ F halts M] 


We pair each LF object such as step M M’ together with the LF context in 
which it is meaningful [21J26]19]. We refer to such an object as a contextual ob- 
ject and embed contextual types, writtenas _ + _ , into Beluga types using the 
“box” syntax. In this example, the LF context, written on the left of | , is empty, 
as we consider closed LF objects. As before, the free variables M and M’ are implic- 
itly quantified at the outside. They themselves stand for contextual objects and 
have contextual type ( tm T). The theorem statements are hence statements 
about contextual LF objects and directly correspond to BELUGA types. 

The proof begins with a single subgoal whose type is simply the statement 
of the theorem under no assumptions. Since this subgoal has a function type, 
HARPOON will automatically apply the intros action, which introduces assump- 
tions as follows: First, the (implicitly) universally quantified variables M, M’ are 
added to the meta-context. This context collects parameters introduced by uni- 
versal quantifiers. This is in contrast with the computational context, which col- 
lects assumptions introduced by the simple function space. In particular, the 
second phase of the intros action adds the assumptions s : [F step M M’] and 
h : [F halts M’] to the computational context. Observe that since M and M’ have 
type tm T, intros also adds T to the meta-context, although it is implicit in the 
definitions of step and halts and is not visible at all in the theorem statement 
(see the meta-context Fig. [1|step T) 

The proof proceeds by inversion on h. Using the split action, we add the 
two new assumptions S: (F steps M’ M2) and V:(F val M2) to the meta-context 
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Step 1 Step 2 Step 3 
Meta-context: Meta-context: Meta-context: 
T £ ( Ftp) T : CF €p) T 2 C Ftp) 
M: (F tm T) M: ( - tm T) M: ( F tm T) 
M : (F tm T) M? sC ei T) MCE em T) 
M2 : ( F tm T) M2: ( H tm T} 
S : ( F steps M’ M2) S : ( F steps M’ M2) 
V : ( F val M2) V : ( F val M2) 
S> : ( F step M M’) 
Computational context: Computational context: Computational context: 
s : [ F step MM] s : [ F step M M’] s : [ F step M M’] 
h: [ HF balts M’] h: [ F halts M’] h: C F halts M] 
[ F halts M] [ F halts M] [ F halts M] 
> split h > unbox s as S’ > solve [F halts/m (next S’ S) VJ 


Fig. 1. Interactive session of the proof for the halts_step lemma. 


(see Fig. |1| step 1.). To build a proof for [F halts M], we need to show that 
there is a step from M to some value M2. To build such a derivation, we use 
first the unbox action on the computation-level assumption s to obtain an as- 
sumption S’ in the meta-context which is accessible to the LF layer (inside a 
box) (see Fig.|1| step 2.). Finally, we can finish the proof by supplying the term 
[ F halts/m (next S’ S) V] with the solve action (see Fig. |1| step 3). This is 
similar to the exact tactic in Coq. 

The resulting proof script is given below. Assertions are written in boldface 
and curly braces denote new scopes, listing the full meta-context and the full 
computational context. Using an erasure we can then generate a translated pro- 
gram in the external syntax, i.e. the syntax a user would use when implementing 
the proof directly, rather than the internal syntax. It is hence much more com- 
pact than the actual proof script. This program can then be seamlessly combined 
with hand-written BELUGA programs and can also independently type-checked. 


Theorem halts_step:[ | step M M’] > [ F halts M’] > [ F halts M] 


Proof Script Erased program (external syntax) 
intros fn s => fn h => 

LT: C Ptp), M2 C btm T), Moe oC btm T) let [ F halts/m S V] = h in 

| s: [Estep M M’], h: [ F halts M°] let [-S’] =s in 

; split h as [ F halts/m (next S’ S) V] 


case halts/m: 
{Te {Fte Me C Rpm Ty, Me: ta T, 

M2: ( F tm T), S : ( F steps M? M2), V : ( F val M2) 
| s : [ F step MM], h : [ F halts M’] 
; by s as S’ unboxed 

; solve [ F halts/m (next S’ S) V] 
} 


2.3 Setup continued: reducibility 


We now consider one of the key lemmas in the weak normalization proof, called 
the backwards closed lemma, i.e. if M° is reducible at some type T and M steps to 
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M’, then M is also reducible at T. We begin to define a set of terms reducible at a 
type T. All reducible terms are required to halt, and reducible terms at an arrow 
type are required to produce reducible output given reducible input. Concretely, 
a term M is reducible at type (arr T1 T2), if for all terms N:tm T1 where N is 
reducible at type T1, then (app M N) is reducible at type T2. Reducibility cannot 
be directly encoded on the LF layer, as it is not merely describing the syntax 
of an expression or derivation. Instead, we encode the set of reducible terms 
using the stratified type Reduce which is recursively defined on the type T in 
BELUGA (see [16]). Note that we write { } for explicit universal quantification 
over contextual objects. 


stratified Reduce : {T : (F tp)} [F tm T] — ctype = 
| Unit: [F halts M] — Reduce [F unit] [FM] 
| Arr : [F halts M] 
— ({N:(F tm T1)} Reduce [+ T1] [F N] — Reduce [+ T2] [F app M N]) 
— Reduce [F arr T1 T2] [HM]; 


2.4 Backwards Closed Property: msplit, suffices, and by 


We can now state the backwards closed lemma formally as follows: if M’ is re- 
ducible at some type T and M steps to M’, then M is also reducible at T. We prove 
this lemma by induction on T. This is specified by referring to the position of 
the induction variable in the statement. 


Name of theorem: bwd_closed 
Statement of theorem: 

{T : (F tp)} {M : (F tm T)} {M> : (F tm T)} 

[- step M M’] — Reduce [F T] [F M>] — Reduce [H T] [F M] 
Induction order: 1 


After HARPOON automatically introduces the metavariables T, M, and M’ to- 
gether with an assumption s : [H step M M’] and r : Reduce [+ T] [H M’], we 
use msplit T to split the proof into two cases (see Fig. |2| step 1). Whereas split 
case analyzes a BELUGA type, msplit considers the cases for a (contextual) LF 
type. In reality, msplit is implemented in terms of the split action. 

The case for T = unit is straightforward (see Fig. |2| steps 2 and 3). First, 
we use the split action to invert the premise r : Reduce [H unit] [H M>]. Then, 
we use the by action to invoke the halts_step lemma (see Sec. to obtain an 
assumption h : [F halts M]. We solve this case by supplying the term Unit h 
(see Fig. [2] step 3). 

In the case for T = arr T1 T2, we begin similarly by inversion on r us- 
ing the split action (see Fig. |3| step 4). We observe that the goal type is 
Reduce [H arr T1 T2] [HM], which can be produced by using the Arr constructor 
if we can construct a proof for each of the user-specified types, [F halts M] and 
{N:(F tm T1)} Reduce [H T1] [FN] — Reduce [+ T2] [- app M N]. Such back- 
wards reasoning is accomplished via the suffices action. The user supplies a 
term representing an implication whose conclusion is compatible with the cur- 
rent goal and proceeds to prove its premises as specified (see Fig B] step 5). 
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Step 1 Step 2 Step 3 

Meta-context: Meta-context: Meta-context: 
T: (Ftp) M : ( F tm unit ) M : ( F tm unit ) 
M: (FtmT) M’: ( F tm unit ) M’: ( F tm unit ) 
MN : (CFtmT ) 

Computational context: Computational context: Computational context: 
s : [F step M M’] s : [F step M M’] s : [F step M M’] 


r : Reduce [H T] [H M’] r : Reduce [F unit] [F M’] h’?: [F halts M ] 
r : Reduce [F unit] [F M’] 


Reduce [F T] [H M] Reduce [F unit] [F M] Reduce [F unit] [F M] 
> msplit T > split r > by halts_step s h’ as h; 
solve Unit h 


Fig. 2. Backwards Closed Lemma. Step 1: Case analysis of the type T; Steps 2 and 3: 
Base case (T = unit). 


To prove the first premise, we apply the halts_step lemma (see Fig. [3] step 
6). As for the second premise, HARPOON first automatically introduces the 
variable N:(- tm T1) and the assumption ri:Reduce [H T1] [HN], so it remains 
to show Reduce [+ T2] [H app M N]. We deduce r’:Reduce [+ T2] [t+ app M’ N] 
using the assumption rn. Using s:[F step M M°], we build a derivation 
s’:[F step (app M N) (app M’ N)] using s_app. Finally, we appeal to the induc- 
tion hypothesis. Using the by action, we refer to the recursive call to complete 
the proof (see Fig. [3] step 7). The resulting proof script (of around 70 lines) can 
again be translated into a compact program. 

Note that HARPOON allows users to use underscores to stand for arguments 
that are uniquely determined (see HARPOON Proof [3] step [7p. We enforce that 
these underscores stand for uniquely determined objects in order to guarantee 
that the contexts and the goal type of every subgoal are closed. This ensures 
modularity: solving one subgoal does not affect any other open subgoals. As a 
consequence, users are not restricted in their proof development. As they would 
on paper, users can work on goals in any order, mix forward and backward 
reasoning, erase wrong parts, and replace them by correct steps. 

Using the explained actions, one can now prove the fundamental lemma and 
the weak normalization theorem. For a more detailled description of this proof 
in BELUGA see [5]6]. 


Additional actions. HARPOON supports some additional features not dis- 
cussed in this paper; see for a complete list 
of actions. In general, these actions add no expressive power, but enable more 
precise expression of a user’s intent. For example, the invert action splits on 
the type of a given term, ensuring that there is a unique case to consider. It is 
implemented simply as the split action followed by an additional check. 


3 Implementation of Harpoon 


HARPOON is a front end that allows users to construct a proof for a theorem 
statement represented as a BELUGA type. Types in BELUGA include universal 
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Step 4 Step 5 
Meta-context: Meta-context: 

Ti : (F tp) Ti : (Ftp) 

T2 : (F tp) T2 : (Htp) 

M : (F tm (arr T1 T2)) M : (F tm (arr T1 T2)) 

M’ : (em (arr Ti T2)) M? : (F tm (arr T1 T2)) 
Computational context: Computational context: 

s : [F step M M’] s : [F step M M’] 

r : Reduce [F arr T1 T2] [F M’] rn : {N : ( F tm T)}Reduce [F N] [F T] 


— Reduce [+ T2] [H app M’ N] 
h? z [F haite M’] 
r : Reduce [F arr T1 T2] [k M’] 


Reduce [F arr T1 T2][F M] Reduce [F arr T1 T2] [FM] 
> split r > suffices by Arr toshow 
[F halts M], 


{N : ( H tm T1)}Reduce [FH T1] [H N] 
— Reduce [F T2] [F app M N] 


Step 6 Step 7 
Meta-context: Meta-context: 
Ti : (Ftp) Ti : (H tp) 
T2 : (rep) T2 : (Ftp) 
M : (F tm (arr Ti T2)) M : {F tm (arr Ti T2)) 
M? : (F tm (arr Ti T2)) M + {F tm (arr Ti T2)) 
N : (F tm T1) 
Computational context: Computational context: 
s : [F step M M’] s : [F step M M’] 
rn : {N : ( F tm T)} Reduce [F N] [F T] rn : {N : ( F tm T)} Reduce [F N] [F T] 
— Reduce [F T2] [F app M’ N] — Reduce [F T2] [F app M’ N] 
h’ : [F halts M’] h’ : [Ff halts M’] 
r : Reduce [F arr T1 T2] [F M’] r : Reduce [F arr T1 T2] [F M’] 
ri : Reduce [H T1] [F N] 
[F halts M] Reduce [+ T2] [F app M N] 
> by halts_step s h’ as h > by (rn [F N] r1) as r’; 
unbox s as S; 
by (bwd_closed _ _ _ [F s_app S] r’) as ih 


Fig. 3. Backwards Closed Lemma: Step Case 


quantification over contextual types (dependent function space, written with 
curly braces), implications (simple function space), boxed contextual types, and 


stratified/recursive types (written as c Č where C stands for a contextual ob- 
ject). In addition, BELUGA supports quantification over LF contexts and even 
LF substitutions relating two LF contexts. We omit these below for simplicity, 
although they are also supported in HARPOON. In essence, BELUGA types cor- 
respond to statements in first-order logic over a domain consisting of contextual 


objects, LF contexts, and LF substitutions. We can view c C and [WF A] as 
atomic propositions. 


Types ruse C |W A {XEF Ar| non 
Meta-Context A ::= - | A,X: (V F- A) 
Context Fase Dar 


Users construct a natural deduction proof for a theorem statement where 
I, the computation context, contains hypotheses introduced from the simple 
function space and where A, the meta-context, holds parameters introduced 
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from the universal quantifier (curly-brace syntax) or by lifting an assumption 
[V + A] from T (box-elimination rule). 


A subgoal in HARPOON is a typed hole in the proof that remains to be filled 
by the user. Such a hole is represented by a subgoal variable, the type of which is a 
contextual type (4; T F T) that captures the typechecking state at the point the 
variable occurs [I9J3]: it remains to construct a proof for r with the parameters 
from A and the assumptions from J’. Subgoal variables in the proof script are 
collected into a subgoal context and substitution of subgoal variables is type- 
preserving [8]. Interactive actions are implemented with subgoal substitutions, 
so the correctness of interactive proof refinement is a consequence of the subgoal 
substitution property. Note that a subgoal’s type cannot itself contain subgoals — 
the subgoal type must be fully determined, so solving one subgoal cannot affect 
any other subgoal. Furthermore, subgoal variables may be introduced only in 
positions where we must construct a normal term (written e); these are terms 
that we must check against a given type. This given type becomes part of the 
subgoal’s type. Subgoal variables stand thus in contrast with ordinary variables, 
which are neutral terms (written 7). (See [42616] for examples of this so-called 
bi-directional characterization of normal and neutral proof terms in BELUGA.) 


An action is executed on a subgoal to eliminate it, while possibly introducing 
new subgoals. Actions emphasize the bi-directional nature of interactive proof 
construction: some demand normal terms e and others demand neutral terms i. 
To execute an action, the system synthesizes a proof script fragment from it, and 
substitutes that fragment for the current subgoal. Any subgoal variables present 
in the fragment become part of the subgoal context, and the user will have to 
solve them later. When no subgoals remain, the proof script is closed and can be 
translated straightforwardly to a BELUGA program in internal (fully elaborated) 
syntax. We employ an erasure to display the program to the user. These are the 
essential actions for proof development, omitting our so-called “administrative” 
actions (such as undo): 


Actions a ::= intros | solve e | by į as x | unbox i as X | split i | suffices i by 7 


intros introduces all assumptions from function types in the current goal; 
solve closes the current subgoal with a given a normal term, introducing no 
new subgoals. This action trivially makes HARPOON complete, as a full BELUGA 
program could be given via solve to eliminate the initial subgoal of any proof. 
The action by enables introducing an intermediate result, often from a lemma or 
an induction hypothesis, demanding a neutral term ¿ and binding it to a given 
name; unbox is the same as by, but it binds the result as a variable in the meta- 
context; split considers a covering set of cases for a neutral term (typically a 
variable) and generates possible induction hypotheses based on the specified in- 
duction order, (for details on coverage, see [24]); suffices allows programmers 
to reason backwards by supplying a neutral term 7 of function type and the types 
7 of arguments to construct for this function. 
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4 Empirical evaluation of Harpoon 


We give a summary of representative case studies that we replayed using HAR- 
POON in Table[I] In porting these proofs to HARPOON, we use solve e only when 
e is atomic, i.e. it describes either a contextual LF term or a constant applied 
to all its arguments (either e = M, e = [C] ore = c foi €1...€,). We list in the 
table the number of commands used to complete the proof and what particular 
features made the selected case study interesting for testing HARPOON. The first 


Case study Main feature tested 

MiniML value soundness Automatic solving of trivial goals 
MiniML compilation completeness Unboxing program variables 
STLC type preservation Automatic solving of trivial goals 


Open term manipulation; (Contexts, Parame- 
ter variables) 

Case analysis on LF contexts, substitution vari- 
STLC weak normalization [6] ables, parameter variables, and inductive and 
stratified types. 

Larger development (310 commands), all forms 
of case analysis as above. 

Larger development (180 commands), all forms 
of case analysis as above. 


STLC type uniqueness [22] 


STLC strong normalization [I] 
STLC alg. equality completeness [6] 


Table 1. Summary of proofs ported to HARPOON from BELUGA. 


four examples proceed by straightforward induction, but the remaining examples 
are less direct since they feature logical relations. The STLC strong normaliza- 
tion and algorithmic equality completeness examples are larger developments, 
totalling 38 and 26 theorems respectively. Crucially, these case studies make use 
of BELUGA’s domain-specific abstractions, by splitting on contexts, reasoning 
about object-language variables, and exploiting the built-in equational theory of 
substitutions. We have since used HARPOON to replay the meta-theoretic proofs 
about Standard ML from [I8]. 

This evaluation gives us confidence in the robustness and expressive power 
of HARPOON. 


5 Related work 


There are several approaches to specify and reason about formal systems. 
BELUGA and hence HARPOON belong to the lineage of the Twelf system [20], 
which also implements the logical framework LF. Metatheoretic proofs in Twelf 
are implemented as relations. Totality checking then ensures that these relations 
correspond to actual proofs. As Twelf is limited to proving I; formulas (“forall- 
exists” statements), normalization proofs using logical relations cannot be di- 
rectly encoded. Although HARPOON’s actions are largely inspired by the internal 
actions of Twelf’s (experimental) fully-automated metatheorem prover [2827], 
HARPOON supports user interaction, more expressive theorem statements, and 
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generation of proof witnesses, in the form of both the generated proof script and 
BELUGA program resulting from translation. 

The Abella system [II] also provides an interactive theorem prover for rea- 
soning about specifications using HOAS. First, its theoretical basis is quite dif- 
ferent from BELUGA’s: Abella’s reasoning logic extends first-order logic with 
a V quantifier [[2] that is used to express properties about variables. Second, 
Abella’s interactive mode provides a fixed set of tactics, similar to the actions we 
describe in this paper. However, these tactics only loosely connect to the actual 
theoretical foundation of Abella and no proof terms are generated as witnesses 
by the Abella system. 

We can also reason about formal systems in general purpose proof assistants 
such as Coq. The general philosophy in such systems is that users should be 
in the position of writing complex domain-specific tactics to facilitate proof 
construction using languages such as LTac [7] or MTac(2) 29M7]. Although 
this is an extremely flexible approach, we believe that the tactic-centric view 
often obscures the actual line of reasoning in the proof. The proofs themselves 
can often be illegible and incomprehensible. Further, strong static guarantees 
about interactive proof construction are lacking; for example, dynamic checks 
enforce variable dependencies. In contrast, our goal is to enable mechanized proof 
development in a style close to that of a proof on paper. Thus we provide a fixed 
set of tactics suitable for a wide array of proofs, so users can concentrate on proof 
development instead of tactic development. As such, our work draws inspiration 
from [2] where the authors describe high-level actions within the tutorial proof 
checker Tutch. Our work extends and adapts this view to the mechanization of 
inductive metatheoretic proofs based on HOAS representations. 


6 Conclusion 


We have presented HARPOON, an interactive command-driven front-end of BEL- 
UGA for mechanizing meta-theoretic proofs based on high-level actions. The 
sequence of interactive actions is elaborated into a proof script behind the 
scenes that represents an assertion-level proof. Last, proof scripts can soundly be 
translated to BELUGA programs. We have evaluated HARPOON on several case- 
studies, ranging from purely syntactic arguments to proofs by logical relations. 
Our experience is that HARPOON lowers the entry barrier for users to develop 
meta-theoretic proofs about HOAS encodings. 

In the future, we aim to extend HARPOON with additional high-level actions 
that support further automation. A natural first step is to support an action 
trivial which would attempt to automatically close an open sub-goal. 
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